index.html

<!doctype html>
<html lang="en">

<head>
  <meta charset="utf-8">
  <title>MediaStreamTrack Content Hints</title>
  <script src="https://www.w3.org/Tools/respec/respec-w3c" class="remove"></script>
  <script class='remove'>
  "use strict";
  // See https://github.com/w3c/respec/wiki/ for how to configure ReSpec
  var respecConfig = {
    "doRDFa": false,
    "githubAPI": "https://api.github.com/repos/w3c/mst-content-hint",
    "issueBase": "https://www.github.com/w3c/mst-content-hint/issues/",
    "noLegacyStyle": true,
    "editors": [
      {   name: "Harald Alvestrand",  company: "Google", w3cid: "24610" }
    ],
    "formerEditors": [{
        name: "Peter Boström",
        company: "Google",
      },
      // Add additional editors here.
      // https://github.com/w3c/respec/wiki/editors
    ],
    "shortName": "mst-content-hint",
    "edDraftURI": "https://w3c.github.io/mst-content-hint/",
    "subjectPrefix": "[mst-content-hint]",
    "otherLinks": [{
      "key": "Repository",
      "data": [{
        "value": "We are on GitHub.",
        "href": "https://github.com/w3c/mst-content-hint"
      }, {
        "value": "File a bug.",
        "href": "https://github.com/w3c/mst-content-hint/issues"
      }, {
        "value": "Commit history.",
        "href": "https://github.com/w3c/mst-content-hint/commits/gh-pages"
      }, ]
    }],
      wg:           "Web Real-Time Communications Working Group",

      // URI of the public WG page
      wgURI:        "https://www.w3.org/2011/04/webrtc/",

      // name (without the @w3c.org) of the public mailing to which comments are due
      wgPublicList: "public-media-capture",
      wgPatentURI:  "https://www.w3.org/2004/01/pp-impl/47318/status",
      specStatus:           "ED",
      // until we separate out the test suite for this spec:
      testSuiteURI: "https://github.com/web-platform-tests/wpt/tree/master/webrtc/",
  };
  </script>
</head>

<body>
  <section id="abstract">
    <p>
      This specification extends <a>MediaStreamTrack</a> to provide a
      media-content hint attribute. This optional hint permits
      <a>MediaStreamTrack</a> consumers such as <dfn>PeerConnection</dfn>
      (defined in [[webrtc]]) or <dfn>MediaRecorder</dfn> (defined in
      [[mediastream-recording]]) to encode or process track media with methods
      more appropriate to the type of content that is being consumed.
    </p>
    <p>
      Adding a media-content hint provides a way for a web application to help
      track consumers make more informed decision of what encoder parameters and
      processing algorithms to use on the consumed content.
    </p>
  </section>
  <section id="sotd">
  </section>
  <section id="introduction">
    <h2>Introduction</h2>
    <p>
      Algorithms used for processing speech and music differ greatly.
      Echo-cancellation algorithms developed for speech-type content might not
      work well on music, and noise-suppression algorithms might remove drum
      snares or other "noisy" content. While this makes speech more intelligible
      it is less appropriate for music signals.
    </p>
    <p>
      For video, webcam content often require denoising and is often
      intelligible even when downscaled or with high quantization levels.
      Screencast content of presentations or webpages with a lot of text content
      is completely unintelligible if the quantization levels are too high or
      if the content is downscaled or otherwise blurry.
    </p>
    <p>
      Without automatic detection of media content, a <a>MediaStreamTrack</a>
      consumer can only make an educated guess. This guess may be based on
      assuming that screencast content, such as
      <code>chrome.desktopCapture</code>, contains text content and must use low
      quantization levels, and drop frames extensively to meet bitrate
      requirements. Another assumption is that regular USB video devices provide
      webcam video, and higher quantization levels and downscaling are
      acceptible.
    </p>
    <p>
      While usually appropriate this educated guess leads to sub-optimal
      settings when incorrect. This manifests as high framedropping when
      screencasting high-motion content such as a movie or streaming a video
      game and treating it as text. Treating highly-detailed content as regular
      webcam video on the other hand leads to too-blurry content when being
      either quantized or downscaled beyond readability to meet bitrate
      requirements. This mismatch may also happen when HDMI video-capture cards
      are seen as USB webcams but actually screencast webpage text.
    </p>
    <figure id="text-downscaling">
      <img src="lorem-ipsum.png" alt="Lost text intelligibility when downscaling."/>
      <figcaption>
        While downscaling can be done to preserve motion in low-bitrate
        scenarios, this example illustrates lost text intelligibility when
        incorrectly applied to detailed content. Example shows 100%, 50% and 25%
        cubic downscale corresponding to downscaling from HD to VGA and QVGA
        resolutions respectively.
      </figcaption>
    </figure>
    <p>
      In some cases the web application can make a more-educated guess or take
      user input to inform consumers of what kind of content is being encoded. A
      web application that streams video-game content would be able to preserve
      motion from desktop capture at the cost of individual frame detail. A
      music-studio application would be able to prevent noise suppression from
      removing snares from a music track.
    </p>
    <p>
      These settings are not intended to replace encoder-level settings
      completely but rather complement them with a simpler hint that does not
      require broad knowledge of video encoders, audio-processing steps or
      more extensive tuning.
    </p>
    <p>
      A separate section of this specification describes expected behavior
      for specific components that process a MediaStreamTrack.
    </p>
  </section>
  <section id="conformance">
  </section>
  <section id="terminology">
    <h2>Terminology</h2>
    <p> The terms <dfn data-cite="!WEBRTC#dom-rtcrtpsender">RTCRtpSender</dfn>
      and <dfn data-cite="!WEBRTC#dom-rtcrtpsendparameters">RTCRtpSendParameters</dfn>
      are defined in [[!WEBRTC]].
    </p>
  </section>
  <section data-link-for="MediaStreamTrack" id="mediastreamtrack-extension">
    <h2>Extension to MediaStreamTrack</h2>
    <pre class="idl">
    partial interface <span class="idlInterfaceID">MediaStreamTrack</span> {
      attribute DOMString contentHint;
    };
    </pre>
    <p>
      This specification extends <dfn>MediaStreamTrack</dfn> and makes use of
      its <dfn>kind</dfn> attribute as defined in [[GETUSERMEDIA]].
    </p>
    <p>
      Each <a>MediaStreamTrack</a> has an associated <dfn>application-set
      content hint</dfn>, which is initially <code>""</code>, signifying unset.
      This <a>application-set content hint</a> corresponds to the
      <dfn data-dfn-for="MediaStreamTrack">contentHint</dfn> attribute of <a>MediaStreamTrack</a> which may be
      used by the web application to provide a hint of what type of content is
      contained within the track, to guide how it should be treated by
      <a>MediaStreamTrack</a> consumers.
    </p>
    <p>
      Valid values for the application-set content hint are dependent on the
      <a>kind</a> of <a>MediaStreamTrack</a> contained. On setting
      <a>contentHint</a> to <i>value</i>,
    </p>
    <ol>
      <li>
        If this <a>MediaStreamTrack</a>'s <a>kind</a> attribute is
        <code>"audio"</code>, and <i>value</i> is not one of <code>""</code>,
        <code>"speech"</code>, <code>"speechRecognition"</code>, or <code>"music"</code>, abort these steps.
      </li>
      <li>
        If this <a>MediaStreamTrack</a>'s <a>kind</a> attribute is
        <code>"video"</code>, and <i>value</i> is not one of <code>""</code>,
        <code>"motion"</code>, <code>"detail"</code> or <code>"text"</code>,
        abort these steps.
      </li>
      <li>
        Set this <a>MediaStreamTrack</a>'s application-set content hint to
        <i>value</i>.
      </li>
      <li>
        The implementation should adapt its decision on how to handle the
        content of this <a>MediaStreamTrack</a> according to the new value of
        its application-set content hint. This adaptation should happen as
        quickly as reasonable, e.g. within the next couple of captured video
        frames or audio buffers.
      </li>
    </ol>
    <p>
      On getting <a>contentHint</a>,
    </p>
    <ol>
      <li>
        Return this <a>MediaStreamTrack</a>'s application-set content hint.
      </li>
    </ol>
    <p>
      <i>Note that the initial value of application-set content hints is
        <code>""</code>, corresponding to that no hint has been provided. It
      does not default to the implementation's best guess of contained type of
      content.</i>
    </p>
    <section>
      <h3>Audio Content Hints</h3>
      <p>
        Audio content hints are only applicable when the <a>MediaStreamTrack</a>
        contains an audio track.
      </p>
      <table class="simple"><tbody><tr><th colspan="2">Audio content hints</th></tr>
        <tr><td><code>""</code></td><td><p>
          No hint has been provided, the implementation should make its
          best-informed guess on how to handle contained audio data. This may be
          inferred from how the track was opened or by doing content analysis.
        <tr><td><code id="idl-def-AudioContentHint.speech">speech</code></td><td><p>
          The track should be treated as if it contains speech data. Consuming
          this signal it may be appropriate to apply noise suppression or boost
          intelligibility of the incoming signal.
        </p></td></tr>
        <tr><td><code id="idl-def-AudioContentHint.speechRecognition">speechRecognition</code></td><td><p>
          The track should be treated as if it contains data for the purpose of
          speech recognition by a machine. Consuming this signal it may be
          appropriate to boost intelligibility of the incoming signal for
          transcription and turn off audio-processing components that are used for
          human consumption.
        </p></td></tr>
        <tr><td><code id="idl-def-AudioContentHint.music">music</code></td><td><p>
          The track should be treated as if it contains music data. Generally this
          might imply tuning or turning off audio-processing components that are
          used to process speech data to prevent the audio from being distorted.
        </p></td></tr>
      </tbody></table>
    </section>
    <section>
      <h3>Video Content Hints</h3>
      <p>
        Video content hints are only applicable when the <a>MediaStreamTrack</a>
        contains a video track.
      </p>
      <table class="simple"><tbody><tr><th colspan="2">Video content hints</th></tr>
        <tr><td><code>""</code></td><td><p>
          No hint has been provided, the implementation should make its
          best-informed guess on how contained video content should be treated.
          This can for example be inferred from how the track was opened or by
          doing content analysis.
        <tr><td><code id="idl-def-VideoContentHint.motion">motion</code></td><td><p>
          The track should be treated as if it contains video where motion is
          important. This is normally webcam video, movies or video games.
          Quantization artefacts and downscaling are acceptible in order to
          preserve motion as well as possible while still retaining target
          bitrates. During low bitrates when compromises have to be made, more
          effort is spent on preserving frame rate than edge quality and details.
        </p></td></tr>
        <tr><td><code id="idl-def-VideoContentHint.detail">detail</code></td><td><p>
          The track should be treated as if video details are extra important.
          This is generally applicable to presentations or web pages with text
          content, painting or line art. This setting would normally optimize for
          detail in the resulting individual frames rather than smooth playback.
          Artefacts from quantization or downscaling that make small text or line
          art unintelligible should be avoided.
        </p></td></tr>
        <tr><td><code id="idl-def-VideoContentHint.text">text</code></td><td><p>
          The track should be treated as if video details are extra important,
          and that significant sharp edges and areas of consistent color can
          occur frequently.
          This is generally applicable to presentations or web pages with text
          content. This setting would normally optimize for
          detail in the resulting individual frames rather than smooth
          playback, and may take advantage of encoder tools that optimize for
          text rendering.
          Artefacts from quantization or downscaling that make small text
          or line art unintelligible should be avoided.
        </p></td></tr>
      </tbody></table>
    </section>
  </section>
  <section>
    <h2>Behaviors of other components based on content-hint</h2>
    <section>
      <h3>Behavior of a MediaStreamTrack</h3>
      <p>When setting a contentHint value for a MediaStreamTrack, the
        UA MUST <a>apply a default</a> as follows:
      </p>
      <ul>
        <li>For an audio track with the value "music", and for constraints
          echoCancellation, autoGainControl and noiseSuppression
          apply a default of "false".</li>
        <li>For an audio track with the value "speech", and for constraints
          echoCancellation and autoGainControl, apply a default of
          "true".</li>
        <li>For an audio track with the value "speechRecognition", and for
          constraints echoCancellation, autoGainControl, and noiseSuppression,
          apply a default of "false".</li>
      </ul>
      <p>To <dfn>apply a default</dfn> to the constraint <var>c</var>
        with value <var>t</var>, perform the following steps:
        <ul>
          <li>If the value <var>t</var> satisfies the applied constraints,
            set the setting corresponding to c <var>t</var>.</li>
          <li>Otherwise, choose a value for the setting corresponding to c
            that satisfies the applied constraints.</li>
          <li>Remember the value <var>t</var>.</li>
          <li>In parallel, update the track with the new settings.</li>
        </ul>
      <p>Whenever the "apply constraints" algorithm is subsequently
        run, the UA MUST choose the remembered value <var>t</var> if it is
        now a permitted value.
      </p>
      <p>When setting a contentHint value of "", all remembered values of
        <var>t</var> are removed.
      </p>
    </section>
    <section>
      <h3>Degradation preference when encoding</h3>
      <p>
        When encoding video, and some constraint (bandwidth, CPU) prevents
        encoding at the configured framerate and resolution, the encoder
        must make a choice on how to modify the encoding parameters.
      </p>
      <p>
        This section defines terms to describe the choice, and an enum
        that can be used to indicate that choice in an API.
      </p>
      <div>
        <pre class="idl"
>enum RTCDegradationPreference {
  "maintain-framerate",
  "maintain-resolution",
  "balanced"
};</pre>
        <table data-link-for="RTCDegradationPreference" data-dfn-for=
        "RTCDegradationPreference" class="simple">
          <tbody>
            <tr>
              <th colspan="2"><code><a>RTCDegradationPreference</a></code> Enumeration description</th>
            </tr>
            <tr>
              <td data-tests="RTCRtpParameters-degradationPreference.html"><dfn data-idl><code>maintain-framerate</code></dfn></td>
              <td>
                <p>Degrade resolution in order to maintain framerate.</p>
              </td>
            </tr>
            <tr>
              <td><dfn data-idl><code>maintain-resolution</code></dfn></td>
              <td>
                <p>Degrade framerate in order to maintain resolution.</p>
              </td>
            </tr>
            <tr>
              <td data-tests="RTCRtpParameters-degradationPreference.html"><dfn data-idl><code>balanced</code></dfn></td>
              <td>
                <p>Degrade a balance of framerate and resolution.</p>
              </td>
            </tr>
          </tbody>
        </table>
      </div>
      An attribute is defined for RTCRtpSendParameters that allows this
      to be explicitly indicated for an RTCRtpSender:

      <pre class='idl'>partial dictionary RTCRtpSendParameters {
        RTCDegradationPreference degradationPreference;
       };</pre>
      <section>
        <h2>Dictionary <code><a>RTCRtpSendParameters</a></code> New Members</h2>
        <dl>
          <dt data-tests="RTCRtpParameters-degradationPreference.html"><dfn data-idl><code>degradationPreference</code></dfn> of type
            <span class="idlMemberType"><a>RTCDegradationPreference</a></span>.
          <dd>
            <p>When bandwidth is constrained and the
              <code><a>RTCRtpSender</a></code> needs to choose between degrading
              resolution or degrading framerate,
              <code>degradationPreference</code> indicates which is
              preferred.</p>
          </dd>
        </dl>
      </section>
    </section>
    <section>
      <h3>Behavior of an RTCPeerConnection</h3>
      <p>An <a>RTCRtpSender</a> transmitting a MediaStreamTrack for which a
        contentHint attribute has been
        set MUST use the following degradation preferences, unless an
        explicit degradationPreference attribute has been set in the sender's
        parameters:
      </p>
      <ul>
        <li>For a video track with the attribute value "motion",
          use "maintain-framerate".
        </li>
        <li>For a video track with the attribute value "detail",
          use "maintain-resolution".
        </li>
        <li>For a video track with the attribute value "text",
          use "maintain-resolution". In addition, if the
        encoding codec is AV1, activate encoding tools for "text" mode.
        </li>
      </ul>
    </section>
    <section>
      <h3>Behavior of a MediaStreamRecorder</h3>
      <p>For a video track with the attribute value "text", if the
        encoding codec is AV1, activate encoding tools for "text"
        mode.
      </p>
    </section>
  </section>
  <section>
    <h2>Security and Privacy Considerations</h2>
    <p>
      This specification adds an API where the user can send
      information to the user agent. This information is not
      communicated elsewhere, and is not stored in any permanent
      location.
    </p>
    <p>
      By probing this API, the client may get some information on how
      the user agent implements this specification, which may give some
      insight into its configuration. Apart from this, no user agent
      information or data is exposed to the client through this API,
      and the API does not give any access to browser UI, influence
      over the user agent's security characteristics or the generation
      of temporary identifiers.
    </p>
    <p>
      This API does not distinguish between first-party, third-party
      or incognito contexts; it behaves identically in all these
      cases.
    </p>
    <p>
      No personally identifiable or otherwise sensitive information is
      handled via the use of this API.
    </p>
    
  </section>
</body>

</html>