Information technology — Coded representation of immersive media — Part 2: Omnidirectional media format — Amendment 1: Server-side dynamic adaptation

ISO/IEC 23090-2:2023/DAmd 1.2

ISO/IEC 23090-2:2023/DAmd 1.2: Information technology — Coded representation of immersive media — Part 2: Omnidirectional media format — Amendment 1: Server-side dynamic adaptation

ISO/IEC 23090-2.2:2023/DAM 1:2025(en)

ISO/IEC JTC 1/SC 29

Secretariat: JISC

Date: 2025-08-30

Information technology — Coded representation of immersive media — Part 2: Omnidirectional media format

Amendment 1: Server-side dynamic adaptation

All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below or ISO's member body in the country of the requester.

ISO Copyright Office

CP 401 • CH-1214 Vernier, Geneva

Phone: + 41 22 749 01 11

Email: copyright@iso.org

Website: www.iso.org

Published in Switzerland.

Foreword

ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission) form the specialized system for worldwide standardization. National bodies that are members of ISO or IEC participate in the development of International Standards through technical committees established by the respective organization to deal with particular fields of technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other international organizations, governmental and non-governmental, in liaison with ISO and IEC, also take part in the work.

The procedures used to develop this document and those intended for its further maintenance are described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types of document should be noted. This document was drafted in accordance with the editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives or www.iec.ch/members_experts/refdocs).

ISO and IEC draw attention to the possibility that the implementation of this document may involve the use of (a) patent(s). ISO and IEC take no position concerning the evidence, validity or applicability of any claimed patent rights in respect thereof. As of the date of publication of this document, ISO and IEC had not received notice of (a) patent(s) which may be required to implement this document. However, implementers are cautioned that this may not represent the latest information, which may be obtained from the patent database available at www.iso.org/patents and https://patents.iec.ch. ISO and IEC shall not be held responsible for identifying any or all such patent rights.

Any trade name used in this document is information given for the convenience of users and does not constitute an endorsement.

For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions related to conformity assessment, as well as information about ISO's adherence to the World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT) see www.iso.org/iso/foreword.html. In the IEC, see www.iec.ch/understanding-standards.

This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology, Subcommittee SC 29, Coding of audio, picture, multimedia and hypermedia information.

A list of all parts in the ISO/IEC 23090 series can be found on the ISO and IEC websites.

Any feedback or questions on this document should be directed to the user’s national standards body. A complete listing of these bodies can be found at www.iso.org/members.html and www.iec.ch/national-committees.

Information technology — Coded representation of immersive media — Part 2: Omnidirectional media format
AMENDMENT 1: Server-side dynamic adaptation

In 1 (Scope), add the following reference to Annex H at the end of the list:

— Annex H specifies an overall architecture and adaptation parameters to support server-side dynamic adaptation.

In 4.7.5.3, add the following entries to Table 4 at end:

podr	H.6.1.1	An open-ended restricted scheme type for projected omnidirectional video with dynamic region-wise packing.
erdp	H.6.1.2	Packed equirectangular or cubemap projected video with dynamic region-wise packing.

In 4.7.5.5, add the following table rows before the table row labeled as "sphere region sample entry":

pord

H.6.2

projected omnidirectional video with dynamic region-wise packing

prfr

7.6.2

projection format box

rotn

7.6.5

rotation box

covi

7.6.6

coverage information box

In 4.7.5.8, add the following entry to Table 9 at end:

rwpk

H.4.1.1

Sample grouping specifying the mapping between packed regions and the corresponding projected regions and specifying the location and size of the guard bands, if any.

In 7.6.5.1, replace:

Box Type:	'rotn'
Container:	ProjectedOmniVideoBox or MeshOmniVideoBox
Mandatory:	No
Quantity:	Zero or one

with:

Box Type:	'rotn'
Container:	ProjectedOmniVideoBox, MeshOmniVideoBox, or ProjectedOmniVideoDynamicPackingBox
Mandatory:	No
Quantity:	Zero or one

In 7.6.6.1, replace:

Box Type:	'covi'
Container:	ProjectedOmniVideoBox or SpatialRelationship2DDescriptionBox
Mandatory:	No
Quantity:	Zero or one

with:

Box Type:	'covi'
Container:	ProjectedOmniVideoBox, SpatialRelationship2DDescriptionBox, or ProjectedOmniVideoDynamicPackingBox
Mandatory:	No
Quantity:	Zero or one

Annex G

Add a new annex (Annex H) after Annex G.

(normative)

Server-side dynamic adaptation
1. General

This annex provides an overall architecture and adaptation parameters to support server-side dynamic adaptation. These parameters can be used as part of HTTP requests for OMAF related media segments in URL parameters as specified in ISO/IEC 23009-1, in the forms of

— URL query parameters, and/or

— HTTP header parameters.

It is expected that, when received at dynamic adaptation server, OMAF content (e.g. in the form of streaming segments) will be dynamically selected and/or adapted, according to the received adaptation parameters, and delivered back to an OMAF player.

1. Overall architecture

Figure H.1 shows a typical content flow process for an omnidirectional media application.

NOTE 1 In this content flow process, the dynamic adaptation service is a different server process than the content generation process.

Figure H.1 — Content flow process for omnidirectional media with projected video

The following interfaces are specified in this document:

— E'_a, E'_v, E'_i: audio bitstream, video bitstream, coded image(s), respectively; see Clause 10.

— E'_v-s: video bitstream, see Clause 10.

— F/F': media file; see Clause 7. Moreover, media profiles specified in Clause 10 include the specification of the track formats for F/F', which may contain constraints on the elementary streams contained within the samples of the tracks.

— Clause 8 specifies the delivery related interfaces for DASH delivery.

— Clause 9 specifies the delivery related interfaces for MMT delivery.

The other interfaces in Figure 1 are not specified in this document.

NOTE 2 While the syntax and semantics of the bitstreams E_a, E_v, and E_i are the same as those for E'_a, E'_v, E'_i, respectively, the input interface to the file/segment encapsulation module is not specified.

NOTE 3 While the syntax and semantics of the bitstreams E_v, is the same as for E'_v-s the input interface to the file/segment encapsulation module is not specified.

A real-world audio-visual scene (A) is captured by audio sensors as well as a set of cameras or a camera device with multiple lenses and sensors. The acquisition results in a set of digital image/video (B_i) and audio (B_a) signals. The cameras/lenses typically cover all directions around the centre point of the camera set or camera device, thus the name of 360-degree video.

Audio may be captured using many different microphone configurations and stored as several different content formats, including channel-based signals, static or dynamic (i.e. moving through the 3D scene) object signals, and scene-based signals (e.g. Higher Order Ambisonics). The channel-based signals typically conform to one of the loudspeaker layouts defined in ISO/IEC 23091-3. In an omnidirectional media application, the loudspeaker layout signals of the rendered immersive audio program are binauralized for presentation via headphones.

For audio, no stitching process is needed, since the captured signals are inherently immersive and omnidirectional.

This document specifies the following types of omnidirectional video and images, which differ in the architecture in the image pre-processing for encoding and in the image rendering processing blocks.

— Projected omnidirectional video/images:

— Image pre-processing for encoding: The images (B_i) of the same time instance are stitched, possibly rotated, and projected onto a 2D picture coordinates using a mathematically specified projection format. Optionally, the resulting projected pictures may be further mapped region-wise onto a packed picture. Either projected pictures or packed pictures are subject to video or image encoding.

— Image rendering: Either regions of the decoded packed pictures (if region-wise packing has been applied) or the entire projected picture (otherwise) is mapped onto a rendering mesh suitable for the projection format in use.

— Fisheye omnidirectional video/images:

— Image pre-processing for encoding: Circular images (Bi) captured by fisheye lenses are arranged onto a 2D picture, which is then input to video or image encoding.

— Image rendering: The decoded circular images are stitched using the signalled fisheye-specific parameters.

— Mesh omnidirectional video:

— Image pre-processing for encoding: A 3D mesh consisting of mesh elements is generated, where mesh elements can be either parallelograms or regions of a sphere surface. The images (Bi) of the same time instance are stitched, possibly rotated, and projected onto the 3D mesh. Mesh elements are mapped onto rectangular regions of one or more 2D pictures, which are input to video encoding.

— Image rendering: Rectangular regions of the decoded 2D picture(s) are mapped to the 3D mesh, which is used directly as the rendering mesh.

Further details of the architecture for projected, fisheye, and mesh omnidirectional video/images are provided in subclauses 4.3, 4.4, and 4.5, respectively.

The pre-processed pictures (D) are encoded as coded images (E_i) or a coded video bitstream (E_v). The captured audio (B_a) is encoded as an audio bitstream (E_a). The coded images, video or audio are then composed into a media file for file playback (F) or a sequence of an initialization segment and media segments for streaming (F_s, F_a), according to a particular media container file format. In this document, the media container file format is the ISO Base Media File Format specified in ISO/IEC 14496-12. The file encapsulator also includes metadata into the file or the segments.

The segments F_a are delivered using a delivery mechanism to dynamic adaptation service.

In the dynamic adaptation service, the dynamic adaptation parameters, such as bitrate (see Clause H.3), are sent from the Strategy module in OMAF player to the Adaptation engine module in dynamic adaptation service, which determines which segments to be received based on the dynamic adaptation parameters, according to what Adaptation logic module instructs. The received segments (F_ad) carried in tracks determined by delivery mechanism are identical to the segments (F_a) except when bitstream rewriting is needed. Viewport-dependent video may be carried in multiple tracks, which may be processed and extract the coded video streams in File/segment decapsulator. The coded video streams are merged in the bitstream rewriting into a single video bitstream (E'_v-s). This single video bitstream (E'_v-s) is composed into segments for streaming (F_ad) in the File/segment encapsulation module, according to a particular media container file format. The segments F_ad are delivered to an OMAF player.

NOTE 4 When the OMAF player uses the dynamic adaptation service, the OMAF player can perform a reduced process because, instead of the dynamic adaptation service handling the adaptation and bitstream merging function, the OMAF player can only send the dynamic adaptation parameters.

NOTE 5 If the segments (F_ad) are contained in a file and delivered to an OMAF player, the file is a file that contains the adapted segments, and is not an entire duration file containing content longer than the segments.

The segments F_s are delivered using a delivery mechanism to an OMAF player.

The file that the file encapsulator outputs (F) is identical to the file that the file decapsulator inputs (F'). A file decapsulator processes the file (F') or the received segments (F'_sand F’_ad) and extracts the coded bitstreams (E'_a, E'_v, or E'_i) and parses the metadata. Viewport-dependent video may be carried in multiple tracks, which may be merged in the bitstream rewriting into a single video bitstream E'_v prior to decoding. The audio, video or images are then decoded into decoded signals (B'_a for audio, and D' for images/video). In the image rendering block, the decoded pictures (D') are projected onto the screen of a head-mounted display or any other display device based on the metadata parsed from the file. Likewise, decoded audio (B'_a) is rendered, e.g. through headphones, according to the current viewing orientation. The current viewing orientation is determined by the viewing orientation tracking functionality. When a head-mounted display is in use, the viewing orientation tracking can involve head tracking and possibly also eye tracking. When sphere-relative overlays are in use, the viewing orientation tracking functionality can include or be complemented by viewing position tracking and rendering of overlays with background visual media can take both the viewing position and the viewing orientation into account. Besides being used by the renderer to render the appropriate part of decoded video and audio signals, the current viewing orientation may also be used by the video and audio decoders for decoding optimization. In viewport-dependent delivery, the current viewing orientation is also passed to the strategy module in OMAF player, which determines the video tracks to be received based on the viewing orientation.

The process described above is applicable to both live and on-demand use cases.

1. Signalling of server-side dynamic adaptation information

Server-side dynamic adaptation (SSDA) as complementary to client-side dynamic adaptation (CSDA) suggests that some dynamic adaptation can be performed at the server side, instead of at the client side, as shown in the Figure H.2.

Figure H.2 — Example streaming system using server-side dynamic adaptation

It should be noted that in an SSDA scheme, the Streaming Client can still make some static selection (such as those related to video codec profile, screen size and encryption algorithm), and only leave dynamic adaptation to the server, by collecting and passing dynamic adaptation parameters needed for Adaptation Logic to the server as part of (HTTP) segment requests. Moreover, in a hybrid mode, SSDA and CSDA schemes can be jointly implemented to share dynamic adaptation tasks split between the Streaming Client and Server.

The segment URL parametrization scheme identified by URN "urn:mpeg:dash:urlparam:2014" or extended parametrization scheme identified by URN "urn:mpeg:dash:urlparam:2016" may be present at Adaptation Set level or at the Representation level.

In addition, an EssentialProperty element with a @schemeIdUri attribute equal to "urn:mpeg:mpegI:omaf:2022:serverSideDynamicAdaptation" is referred to as a server-side dynamic adaptation (SSDA) descriptor. When an SSDA descriptor exists in AdaptationSet or Representation, the OMAF player making an SSDA request shall add the query parameter(s) to the segment file URL or alternatively to the HTTP header.

In addition, Adaptation Sets and Representations which have SSDA descriptor shall either have or inherit the following DASH profile description in @profiles DASH attribute: urn:mpeg:mpegI:omaf:dash:profile:serverSideDynamicAdaptation.

Table H.1 provides a list of parameters for the purpose of track selection or switching.

Table H.1 — Parameters for track selection or switching

Query parameter name	Header name	Query parameter value definition
scsw	SSDA-Request	Indicates the width of screen in units of pixels.
scsh	SSDA-Request	Indicates the height of screen in units of pixels.
bitr	SSDA-Request	Indicates the bitrate (kbps) The value is calculated by total bit count of the samples in the track divided by (the total duration of the samples of the segment * 1000).
frar	SSDA-Request	Indicates the frame rate (fps) The value is calculated by number of samples in the segment divided by the total duration of the samples of the segment.
nvws	SSDA-Request	Indicate the number of views in the track. When not present, the default value is 1.

With these parameters in DASH HTTP requests for segments related to a track selection and switching, it is expected to return an HTTP response containing a segment from a track matching the requirements from the parameters, selected or produced with server-side dynamic adaptation.

The value of track_ID shall remain unchanged in all HTTP responses in successive media segment requests of the same Representation regardless of which query parameter values are used.

Values of parameters scsw (width), scsh (height) and nvws (number of views) in an initial request shall remain unchanged in successive media segment requests of the same Adaptation Set, but the values of other query parameters may change dynamically to affect the selection.

Table H.2 provides a list of parameters for the purpose of viewport related adaptation.

Table H.2 — Parameters for viewport related adaptation

Query parameter name	Header name	Query parameter value definition
azim	SSDA-Request	Specifies the azimuth of the centre point of the sphere region in units of 2⁻¹⁶ degrees. When not present, the default value is 0.
elev	SSDA-Request	Specifies the elevation of the centre point of the sphere region in units of 2⁻¹⁶ degrees. When not present, the default value is 0.
tilt	SSDA-Request	Specifies the tilt angle of the sphere region, in units of 2⁻¹⁶ degrees. When not present, the default value is 0.
azrg	SSDA-Request	Specifies the azimuth range of the sphere region through the centre point of the sphere region in units of 2⁻¹⁶ degrees. This parameter shall be present in any request related to a viewport.
elrg	SSDA-Request	Specifies the elevation range of the sphere region through the centre point of the sphere region in units of 2⁻¹⁶ degrees. This parameter shall be present in any request related to a viewport.
styp	SSDA-Request	Specifies the shape type of the sphere region, as specified in 7.7.2.3. This parameter shall be present in any request related to a viewport.

With these parameters in DASH HTTP requests for segments related to a viewport, it is expected to return an HTTP response containing:

— a viewport segment, selected or produced with server-side dynamic adaptation. Here, the viewport segment covers the viewport of at least the same quality as the background, and the viewport segment may include some margin to the viewport.

1. Extensions to the ISOBMFF for server-side dynamic adaptation
  1. Region-wise packing sample group
    1. Definition

The 'rwpk' grouping_type for sample grouping specifies the mapping between packed regions and the corresponding projected regions and specifies the location and size of the guard bands, if any.

If the region-wise packing sample group exist, the region-wise packing box shall not exist.

- - 1. Syntax

class ResionWisePackingEntry() extends VisualSampleGroupEntry ('rwpk'){

RegionWisePackingStruct()region_wise_packing_struct;

}

- - 1. Semantics

Subclause 7.5.3 applies with the following additional constraint:

— packed_picture_width and packed_picture_height shall have such values that packed_picture_width is an integer multiple of width and packed_picture_height is an integer multiple of height, where width and height are syntax elements of the VisualSampleEntry containing this box.

1. Segment format for server-side dynamic adaptation

When any sample of a media segment is mapped to the 'rwpk' sample group, all samples of the media segment shall be mapped to the same 'rwpk' sample group description entry.

1. Restricted video schemes for omnidirectional video for server-side dynamic adaptation
  1. Scheme types
    1. Projected omnidirectional video with dynamic region-wise packing ('podr')

The use of the projected omnidirectional video with dynamic region-wise packing scheme for the restricted video sample entry type 'resv' indicates that the decoded pictures are packed pictures containing either monoscopic or stereoscopic content. The use of the projected omnidirectional video scheme is indicated by scheme_type equal to 'podr' (projected omnidirectional video with dynamic region-wise packing) within SchemeTypeBox in the RestrictedSchemeInfoBox.

The format of the projected monoscopic pictures is indicated with the ProjectedOmniVideoDynamicPackingBox contained within the SchemeInformationBox. One and only one ProjectedOmniVideoDynamicPackingBox shall be present in the SchemeInformationBox when the scheme type is 'podr'.

The 'podr' scheme type is defined as an open-ended scheme type for projected omnidirectional video with dynamic region-wise packing.

As specified in subclause H.6.2, a ProjectionFormatBox shall be present within the ProjectedOmniVideoDynamicPackingBox. ProjectionFormatBox is not constrained beyond the specification in subclause H.6.2. The 'podr' scheme type may be used with any version value of ProjectionFormatBox. The 'podr' scheme type may be used with any projection_type value.

When the ProjectedOmniVideoDynamicPackingBox is present in the SchemeInformationBox, StereoVideoBox may be present in the same SchemeInformationBox.

For stereoscopic video, the frame packing arrangement of the projected left and right pictures is indicated with the StereoVideoBox contained within the SchemeInformationBox. The absence of StereoVideoBox indicates that the omnidirectionally projected content of the track is monoscopic. When StereoVideoBox is present in the SchemeInformationBox for the omnidirectional video scheme, version shall be equal to 0, stereo_scheme shall be equal to 4 and the first byte of stereo_indication_type shall be equal to 3, 4, or 5 indicating that the side-by-side frame packing, the top-bottom frame packing, or the temporal interleaving of alternating first and second constituent frames, respectively, is in use and the second byte of stereo_indication_type shall be equal to 0 indicating that quincunx sampling is not in use.

NOTE The 'stvi' scheme type is not expected to be used when the 'podr' scheme type is used.

Optional dynamic region-wise packing is indicated with the 'rwpk' sample group. The absence of 'rwpk' sample group indicates that no region-wise packing is applied, i.e. that the packed picture is identical to the projected picture.

'rwpk' sample group is not constrained beyond the specification in subclause H.4. The 'podr' scheme type may be used with any values of the syntax elements of 'rwpk' sample group.

In addition to the boxes constrained above, SchemeInformationBox may directly or indirectly contain other boxes. Those boxes are not constrained beyond their definition, syntax, and semantics.

- - 1. Packed equirectangular or cubemap projected video with dynamic region-wise packing ('erdp')

NOTE This scheme type can be used for specifying media profiles.

The 'erdp' scheme type is defined as a closed scheme type for projected omnidirectional video.

When scheme_type is equal to 'erdp' in an instance of CompatibleSchemeTypeBox in the RestrictedSchemeInfoBox, the track conforms to the constraints of scheme_type equal to 'podr', scheme_type equal to 'podr' shall be present in SchemeTypeBox in the RestrictedSchemeInfoBox, and all of the following additional constraints apply:

— ProjectionFormatBox within the ProjectedOmniVideoDynamicPackingBox shall indicate either the equirectangular projection or the cubemap projection.

— When 'rwpk' sample group is present, the value of packing_type[i] for each value of i shall be equal to 0.

— version of ProjectionFormatBox, StereoVideoBox (when present), RotationBox (when present), and CoverageInformationBox (when present) shall be equal to 0.

— SchemeInformationBox shall not directly or indirectly contain any boxes other than ProjectedOmniVideoDynamicPackingBox, ProjectionFormatBox, StereoVideoBox, RotationBox, and CoverageInformationBox.

- 1. Projected omnidirectional video with dynamic region-wise packing box
    1. Definition

Box Type:	'pord'
Container:	SchemeInformationBox
Mandatory:	Yes, when scheme_type is equal to 'podr'
Quantity:	Zero or one

ProjectedOmniVideoDynamicPackingBox contains boxes indicating information for the following:

— the projection format of the projected picture (C for monoscopic video contained in the track, C_L and C_R for left and right view of stereoscopic video),

— dynamic region-wise packing, when applicable,

— the rotation for conversion between the local coordinate axes and the global coordinate axes, if applied, and

— optionally the content coverage of the track.

The values of the variables HorDiv1 and VerDiv1 are set as follows.

— If StereoVideoBox is not present in SchemeInformationBox, HorDiv1 is set equal to 1 and VerDiv1 is set equal to 1.

— Otherwise (StereoVideoBox is present in SchemeInformationBox), the following applies:

— If side-by-side frame packing is indicated, HorDiv1 is set equal to 2 and VerDiv1 is set equal to 1.

— Otherwise, if top-bottom frame packing is indicated, HorDiv1 is set equal to 1 and VerDiv1 is set equal to 2.

— Otherwise (temporal interleaving is indicated), HorDiv1 and VerDiv1 are both set equal to 1.

If RotationBox is not present in ProjectedOmniVideoDynamicPackingBox, RotationFlag is set equal to 0. Otherwise, RotationFlag is set equal to 1.

If StereoVideoBox is not present in SchemeInformationBox, SpatiallyPackedStereoFlag, TopBottomFlag, and SideBySideFlag are set equal to 0. Otherwise, the following applies.

— When the StereoVideoBox indicates top-bottom frame packing, SpatiallyPackedStereoFlag is set equal to 1, TopBottomFlag is set equal to 1, and SideBySideFlag is set equal to 0.

— When the StereoVideoBox indicates side-by-side frame packing, SpatiallyPackedStereoFlag is set equal to 1, TopBottomFlag is set equal to 0, and SideBySideFlag is set equal to 1.

— When the StereoVideoBox indicates temporal interleaving, SpatiallyPackedStereoFlag, TopBottomFlag, and SideBySideFlag are all set equal to 0.

The following applies:

— The width and height of a monoscopic projected luma picture (ConstituentPicWidth and ConstituentPicHeight, respectively) are derived as follows:

— If 'rwpk' sample group is not present, ConstituentPicWidth and ConstituentPicHeight are set to be equal to width / HorDiv1 and height / VerDiv1, respectively, where width and height are syntax elements of VisualSampleEntry.

— Otherwise, ConstituentPicWidth and ConstituentPicHeight are set equal to proj_picture_width / HorDiv1 and proj_picture_height / VerDiv1, respectively.

— If 'rwpk' sample group is not present, RegionWisePackingFlag is set equal to 0. Otherwise, RegionWisePackingFlag is set equal to 1.

— The semantics of the sample locations of each decoded picture resulting by decoding the samples referring to this sample entry are specified in subclause 7.5.1.2.

- - 1. Syntax

aligned(8) class ProjectedOmniVideoDynamicPackingBox extends Box('pord') {

ProjectionFormatBox() projection_format_box; // mandatory

// optional boxes but no fields

}

Table of Contents