ISO/IEC 14496-12:XXXX/DAM 2:2026(en)
ISO/IEC JTC1/SC 29
Secretariat: JISC
Date: 2025-11-17
Information technology — Coding of audio-visual objects — Part 12: ISO base media file format — Amendment 2: Support for carriage of depth and alpha
© ISO/IEC 2026
All rights reserved. Unless otherwise specified, or required in the context of its implementation, no part of this publication may be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below or ISO’s member body in the country of the requester.
ISO copyright office
CP 401 • Ch. de Blandonnet 8
CH-1214 Vernier, Geneva
Phone: +41 22 749 01 11
Email: copyright@iso.org
Website: www.iso.org
Published in Switzerland
Contents
1 Clause 2, Normative references 4
2 Clause 3.1, Terms and definitions 4
3.1 Clause 8.7 Track data layout structures 5
4.1 Clause 8.15 Entity Grouping 8
5 Clause 12, Media-specific definitions 9
5.2 Clause 12.1.5 Colour information 9
5.3 Clause 12.1.10 Pixel Information 10
5.4 Clause 12.1.11 Alpha Information 11
5.5 Clause 12.3.3 Sample entry 12
12.11.2.1 Depth media handler 15
12.11.2.2 Alpha media handler 15
12.11.3.1 Depth media header 15
12.11.3.2 Alpha media header 15
12.11.4.1 Depth media sample entry 15
12.11.4.2 Alpha media sample entry 15
12.11.5 Map Information box 15
12.11.5.1 Depth information box 15
12.11.5.2 Alpha information box 17
12.11.6 Auxiliary Information box 18
12.11.6.1 Invalid depth band box 18
6 Annex B Guidance on deriving from this document 18
7 Annex L Depreciation of boxes and identifiers 19
Annex L (normative) Depreciation of boxes and identifiers 20
8 Annex M Handling of depth and alpha maps 20
Annex M (informative) Handling of depth and alpha maps 21
M.2 Handling of parameters in SEI messages and container level 21
M.3 Lossless conversion between SEI-based and IEEE 754 32-bit floating-point formats 21
M.3.1 SEI-based floating point format 21
M.3.2 IEEE 754 32-bit floating-point format 22
M.3.3 Lossless conversion rules 22
M.4 Interpretation of depth decoded sample values 22
M.5 Example depth information sample 23
Foreword
ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.
The procedures used to develop this document and those intended for its further maintenance are described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types of ISO documents should be noted. This document was drafted in accordance with the editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives).
ISO draws attention to the possibility that the implementation of this document may involve the use of (a) patent(s). ISO takes no position concerning the evidence, validity or applicability of any claimed patent rights in respect thereof. As of the date of publication of this document, ISO [had/had not] received notice of (a) patent(s) which may be required to implement this document. However, implementers are cautioned that this may not represent the latest information, which may be obtained from the patent database available at www.iso.org/patents. ISO shall not be held responsible for identifying any or all such patent rights.
Any trade name used in this document is information given for the convenience of users and does not constitute an endorsement.
For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions related to conformity assessment, as well as information about ISO's adherence to the World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see www.iso.org/iso/foreword.html.
This document was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology, SC 29, Coding of audio, picture, multimedia and hypermedia information.
A list of all parts in the ISO/IEC 14496 series can be found on the ISO website.
Any feedback or questions on this document should be directed to the user’s national standards body. A complete listing of these bodies can be found at www.iso.org/members.html.
Information technology — Coding of audio-visual objects — Part 12: ISO base media file format — Amendment 2: Support for carriage of depth and alpha
1 Clause 2, Normative references
Add the following reference to clause 2:
ISO/IEC 23001‑17, Information technology — MPEG systems technologies — Part 17: Carriage of uncompressed video and images in ISO base media file format
2 Clause 3.1, Terms and definitions
Add the following definitions to clause 3.1:
3.70
alpha data
a collection of alpha values (3.73)
3.71
alpha elementary stream
an elementary stream (3.79) containing access units for coded alpha data (3.70)
3.72
alpha image
an array of pixels representation of alpha data (3.70), with each pixel associated with an alpha value (3.73)
3.73
alpha value
transparency value determined relative to a minimum and a maximum value, with the minimum value indicating full transparency and the maximum value indicating full opacity
3.74
depth data
a collection of depth values (3.78)
3.75
depth data range
a range of depth values (3.78) defined by a minimum and a maximum depth value (3.78)
3.76
depth elementary stream
an elementary stream (3.79) containing access units for coded depth data (3.74)
3.77
depth image
an array of pixels representation of depth data (3.74), with each pixel associated with a depth value (3.78)
3.78
depth value
distance of a given point in 3D space relative to an origin, as measured along a certain direction
3.79
elementary stream
a consecutive flow of mono-media data from a single source entity to a single destination entity on the compression layer
3.80
image representation
a 2D array of samples, with each sample associated with an intensity value
3 Clause 8, Box structures
3.1 Clause 8.7 Track data layout structures
Add new clause 8.7.10 in track data layout structures (section 8.7)
8.7.10 Compact Direct Sample References
8.7.10.1 Compact Direct Sample References Box
8.7.10.1.1 Definition
Box Type: | 'cdrf' |
Container: | SampleTableBox or TrackFragmentBox |
Mandatory: | No |
Quantity: | Zero or one (per container) |
CompactDirectSampleReferencesBox provides explicit coding dependencies of samples towards other samples. It associates with each sample of a track or track fragment:
— a sample identifier (sample ID) coded as an absolute sample ID or as a difference with the sample ID of a previous sample;
— a dependency list of sample references coded as a difference with the sample ID or as the identifier of another sample.
The listed dependencies should only contain the direct dependencies, i.e. if sample A depends on sample B which in turn depends on sample C, only sample B should be listed as a dependency to sample A.
CompactDirectSampleReferencesBox is structured as a list of entries for reconstructing the list of samples and their sample references, each entry defining either the sample identifier and the dependency list associated with a sample or a reference to a pattern of previously coded sample IDs and sample references in the reconstructed list of samples and their sample references.
The sample ID may be any identifier. It does not need to be unique, and the IDs used by the list of sample references refer to the previous sample defined with the given ID in the CompactDirectSampleReferencesBox. A sample ID used in a dependency list but not present in the track or past track fragments indicate a broken dependency, i.e. that the sample cannot be decoded.
NOTE For video tracks, the sample ID could be the Picture Order Count (POC) of a sample. Broken dependencies typically happen when tuning a stream on a stream access point sample of type 3 (SAP 3 sample), e.g. in an open-GOP case.
The no_diff_mode variable indicates if the dependencies and sample IDs are coded as direct values or as differential values.
The reconstruction process of the list of samples and their sample references shall produce the same result as the following model:
— initializing an empty list flat_refs
— for each entry in the box,
— if is_ref is 0, appending to the flat_refs list an entry containing {nb_refs, is_abs, sample_ID_code, ref_IDs}
— otherwise (is_ref is 1), for each K ranging from 0 to num_samples – 1 of the entry,
— If no_diff_mode is 0, appending to flat_refs the entry flat_refs[offset + K%pattern_length]
— Otherwise, appending to flat_refs a copy of flat_refs[offset + K%pattern_length] with sample_ID_code equal to sample_IDs[K]
— validating that the number of entries in flat_refs is the same as the number of samples in the track or track fragment
— for each sample J in the track or track run:
— assigning sampleID as follows:
— if flat_refs[J].is_abs is 1, setting sampleID to flat_refs[J].sample_ID_code
— otherwise, setting sampleID to flat_refs[J].sample_ID_code + sample[J-1].sampleID
— assigning sample reference IDs for each K in range [0, flat_refs[J].nb_refs] by:
— If no_diff_mode is 0, removing from sample[J].sampleID the value flat_refs[J].ref_IDs[K], i.e. referenceSampleID = sample[J].sampleID - flat_refs[J].ref_IDs[K],
— Otherwise, using the value flat_refs[J].ref_IDs[K]
8.7.10.1.2 Syntax
aligned(8) class CompactDirectSampleReferencesBox extends Box('cdrf')
{
unsigned int(8) flags;
if (flags & 2) bits = 32;
else if (flags & 1) bits = 16;
else bits = 8
if (flags & 8) entry_bits=32;
else if (flags & 4) entry_bits=16;
else entry_bits=8
unsigned int no_diff_mode = (flags & 16) ? 1 : 0;
unsigned int(entry_bits) nb_entries;
for (i = 0;i < nb_entries; i++) {
bit(1) is_ref;
if (is_ref) {
unsigned int(bits-1) offset;
unsigned int(bits) pattern_length;
unsigned int(bits) num_samples;
if (no_diff_mode) {
signed int(bits) sample_IDs[num_samples];
} else {
unsigned int(bits-1) nb_refs;
if (no_diff_mode) {
signed int(bits) sample_ID_code;
} else {
bit(1) is_abs;
signed int(bits-1) sample_ID_code;
}
signed int(bits) ref_IDs[nb_refs];
}
}
}
8.7.10.1.3 Semantics
nb_entries indicates the number of entries in the loop for reconstructing the list of references.
no_diff_mode indicates, when set to 1, that the sample references in ref_IDs[] and the sample IDs are coded as direct values, and when set to 0, that the sample references in ref_IDs[] are coded as differential values and sample IDs may be coded as differential or direct values depending on is_abs.
is_ref indicates, if set to 1, that the entry is a reference to a pattern of previously coded sample references in the reconstructed list of references. Otherwise, if set to 0, an explicit list of sample references follows. The first sample in a track or track fragment shall have an associated is_ref value of 0.
offset indicates the start of the pattern in the reconstructed list of references, value 0 designating the list of sample references of the first sample in the track or track fragment.
pattern_length indicates the number of samples in the pattern starting from offset
num_samples indicates the number of samples described. If this value is greater than pattern_length, the pattern is looped over until all samples indicated by the number of samples num_samples are described. It is not necessarily the case that num_samples is a multiple of pattern_length; the last repeated pattern may be truncated.
nb_refs indicates the number of direct sample references for this sample. If 0, the sample has no direct references (i.e. the sample is a sync sample).
NOTE The number of direct sample references nb_refs can represent a number of sample references possibly used for processing the sample, and does not necessarily represent a number of sample references actually used in a codec reference list for processing the sample.
is_abs indicates, if set to 1, that sample_ID_code is the value of the sample identifier. If set to 0, it indicates that sample_ID_code is the difference between the sample identifier and the preceding sample identifier in the reconstructed list of references. The first sample in a track or track fragment shall have an associated is_abs value of 1. When no_diff_mode is 1, the is_abs field is not coded and takes the value 1.
sample_ID_code indicates the value of the sample identifier. If no_diff_mode is 1, it indicates the direct value of the sample identifier. If no_diff_mode is 0, it indicates the difference or direct value of the sample identifier coded as specified by is_abs flag.
ref_IDs is an array that indicates the sample identifiers of the direct sample references if any. If no_diff_mode is 0, the identifier is coded as a difference between the identifier of the sample being described by this entry and the identifier of the reference sample, i.e. sampleID - referenceSampleID. Otherwise (no_diff_mode is 1), the identifier is coded as the value of the sample identifier of the reference sample.
sample_IDs is an array that indicate the list of values of the sample identifiers coded in a pattern.
When present in a SampleTableBox (respectively a TrackFragmentBox), the number of samples described by CompactDirectSampleReferencesBox shall be equal to the number of samples present in the track (respectively in the track fragment).
4 Clause 8, Box structures
4.1 Clause 8.15 Entity Grouping
Add the following at the end in 8.15.3.1
'prse': The entities mapped to this group belong to the same presentation/session as specified in 8.15.5.1
Add the following subclause 8.15.5.1
8.15.5.1 Presentation group box
8.15.5.1.1 Definition
Box Type: | 'prse' |
Container: | GroupsListBox in a MetaBox on movie level for tracks GroupsListBox in a MetaBox at file level for items |
Mandatory: | No |
Quantity: | exactly one |
The PresentationGroupBox carries unique identifiers for the entities in the file.
The presentation_ID present in the PresentationGroupBox can be used for determining that different files are part of the same presentation/session. If an item or a track belongs to two or more different presentations/sessions, multiple presentation_IDs are present in the PresentationGroupBox which can be then mapped to different presentations/sessions.
8.15.5.1.2 Syntax
aligned(8) class PresentationGroupBox
extends EntityToGroupBox('prse', version=0, flags)
{
for(i=0; i<num_entities_in_group; i++) {
unsigned int(8) presentation_ids_count_minus1;
unsigned int(128) presentation_ID[presentation_ids_count_minus1+ 1];
}
}
8.15.5.1.3 Semantics
The following values are defined for the flags field when entities associated by PresentationGroupBox are tracks:
TRACK_MERGE_PROCESS (flag mask is 0x000001): if this flag is set in the flies having the same presentation_id, the following process is applied:
— if the entity ID of the track overlap in files then the non-overlapping samples in decoding time and respective track metadata shall be selected from any of these tracks. The selected samples and metadata are in a manner that there shall be a sync sample when a switch to another of these tracks takes place.
— if the entity IDs of the track differ in the files then the tracks can be combined to a single file. Otherwise (when TRACK_MERGE_PROCESS is not set), no merging process is specified.
NOTE When tracks in different files do not carry a presentation_ID, there is no guarantee that combining those files lead to a conforming ISOBMFF file.
presentation_ids_count_minus1 plus 1 indicates the number of unique IDs associated with an entity.
presentation_ID is an array of unique values. Array elements of presentation_ID can take integer values (UUID) as defined in RFC 4122.
5 Clause 12, Media-specific definitions
5.1 Clause 12.1.3.2 Syntax
Replace clause 12.1.3.2 with:
class VisualSampleEntry(codingname) extends SampleEntry (codingname)
{
unsigned int(16) pre_defined = 0;
const unsigned int(16) reserved = 0;
unsigned int(32) pre_defined[3] = 0;
unsigned int(16) width;
unsigned int(16) height;
template unsigned int(32) horizresolution = 0x00480000; // 72 dpi
template unsigned int(32) vertresolution = 0x00480000; // 72 dpi
const unsigned int(32) reserved = 0;
template unsigned int(16) frame_count = 1;
bit(8) compressorname[32];
template unsigned int(16) depth = 0x0018;
int(16) pre_defined = -1;
// other boxes from derived specifications
CleanApertureBox clap; // optional
PixelAspectRatioBox pasp; // optional
PixelInformationBox pasp; // optional
AlphaInformationBox alpi; // optional
}
5.2 Clause 12.1.5 Colour information
Replace clause 12.1.5.2 with:
class ColourInformationBox extends Box('colr')
{
unsigned int(32) colour_type;
if (colour_type == 'nclx') /* on-screen colours */
{
unsigned int(16) colour_primaries;
unsigned int(16) transfer_characteristics;
unsigned int(16) matrix_coefficients;
unsigned int(1) full_range_flag;
unsigned int(7) reserved = 0;
}
else if (colour_type == 'rICC')
{
ICC_profile; // restricted ICC profile
}
else if (colour_type == 'prof')
{
ICC_profile; // unrestricted ICC profile
}
else if (colour_type == 'bICC')
{
ICC_profile; // Brotli compressed unrestricted ICC profile
}
}
5.3 Clause 12.1.10 Pixel Information
Add the following new clause 12.1.10:
12.1.10 Pixel information
12.1.10.1 Definition
Box type: | 'pixi' |
Container: | VisualSampleEntry or ItemPropertyContainerBox |
Mandatory: | No |
Quantity: | At most one |
The PixelInformationBox, if present, indicates the number and bit depth of colour and alpha/depth components of the decoded samples. If px_flags & 1 != 0, the PixelInformationBox also indicates content, component format and subsampling information per channel.
12.1.10.2 Syntax
aligned(8) class PixelInformationBox extends FullBox('pixi', version = 0, px_flags) {
unsigned int(8) num_channels;
for (i=0; i<num_channels; i++) {
unsigned int(8) bits_per_channel;
}
if((px_flags & 1) != 0) {
for (i=0; i<num_channels; i++) {
unsigned int(3) channel_idc;
unsigned int(1) reserved = 0;
unsigned int(2) component_format;
unsigned int(1) subsampling_flag;
unsigned int(1) channel_label_flag;
if(subsampling_flag) {
unsigned int(4) subsampling_type;
unsigned int(4) subsampling_location;
}
if(channel_label_flag) {
utf8string channel_label;
}
}
}
}
12.1.10.3 Semantics
px_flags & 1 if equal to 1, indicates that the channel_idc, component_format, subsampling_flag, and channel_label_flag fields are present. If equal to 0, indicates that the channel_idc, component_format, subsampling_flag and channel_label_flag fields are not present.
num_channels indicates the number of channels for each pixel of the decoded samples.
bits_per_channel indicates the bits per channel for the pixels of the decoded samples. The value of this field shall not be 0.
channel_idc indicates the contents of the channel as specified in Table 16. At most one channel shall have a channel_idc of 5.
Table 16 — channel_idc values and their meaning.
Value of channel_idc | Mapping (depending on the 'colr' box) |
0 | Unused |
1 | Unspecified |
2 | First colour channel (e.g. monochrome, Y, R, C) |
3 | Second colour channel (e.g. U, Cb, G, M) |
4 | Third colour channel (e.g. V, Cr, B, Y) |
5 | Alpha |
6 | Depth |
7 | Fourth colour channel (e.g. K) |
component_format indicates the data type of the channel as defined by the component_format values in ISO/IEC 23001‑17 where component_bit_depth is considered to be equal to bits_per_channel.
subsampling_flag if equal to 1, indicates that the subsampling_type and subsampling_location fields are present. If equal to 0, indicates that the subsampling_type and subsampling_location fields are not present.
channel_label_flag if equal to 1, indicates the presence of the channel_label field. If equal to 0, indicates the channel_label field is not present.
subsampling_type indicates the subsampling type as specified by Table XX2.
subsampling_location indicates the subsampling sample location as specified by Table XX2.
channel_label is the human readable description of the channel.
5.4 Clause 12.1.11 Alpha Information
Add the following new clause 12.1.11:
12.1.11 Alpha information
12.1.11.1 Definition
Box type: | 'alpi' |
Container: | VisualSampleEntry or ItemPropertyContainerBox |
Mandatory: | No |
Quantity: | At most one |
The AlphaInformationBox may be used to provide information independent of the coding, to interpret the alpha data. The AlphaInformationBox is optional. Default values are assumed if the box is not present, where the values depend on the type of content.
For integer content it is assumed that a value of 0 indicates full transparency and a value equal to 2^bit_depth - 1 indicates full opacity, where bit_depth is the number of bits used to represent a data point from the original component(s) of an image (e.g. in the case of monochrome image, the bit depth of the luma samples).
For floating-point content it is assumed that a value of 0.0 indicates full transparency and a value of 1.0 indicates full opacity.
12.1.11.2 Syntax
aligned(8) class AlphaInformationBox extends FullBox('pixi', version = 0, flags) {
computed bit is_float = (flags & 4) != 0;
if (is_float) {
float(32) opaque_value;
float(32) transparent_value;
}
else {
unsigned int(16) opaque_value;
unsigned int(16) transparent_value;
}
}
12.1.11.3 Semantics
version is an integer that specifies the version of this box.
flags is a 24-bit integer with flags; the following values are defined:
premultiplication_mode flag mask is 0x000003. Specifies if the pixel values of the associated colour or grayscale channel(s) are premultiplied by the alpha values as follows:
0: the pixel values of the associated colour or grayscale channel(s) are not premultiplied by the alpha values.
1: the pixel values of the associated colour or grayscale channel(s) have been premultiplied by the alpha values in non-linear RGB signal space, obtained by converting the pixel values to RGB according to a MatrixCoefficients value and VideoFullRangeFlag value as defined in ISO/IEC 23091-2.
2: the pixel values of the associated colour or grayscale channel(s) have been premultiplied by the alpha values in linear RGB display light space, obtained by converting the pixel values to RGB according to a MatrixCoefficients value and VideoFullRangeFlag value as defined in ISO/IEC 23091-2, then applying an electro-optical transfer function according to the transfer function signaled in an ICC profile if present, or to a TransferCharacteristics value as defined in ISO/IEC 23091-2 otherwise.
3: reserved.
is_float: Flag mask is 0x000004. If set, specifies that the opaque and transparent values are floating-point rather than integers.
opaque_value: specifies the alpha value for which the associated colour or grayscale channel(s) values are considered opaque for the purposes of alpha blending.
transparent_value: specifies the alpha value for which the associated colour or grayscale channel(s) values are considered transparent for the purposes of alpha blending.
5.5 Clause 12.3.3 Sample entry
Replace clause 12.3.3 with:
12.3.3 Sample entry
12.3.3.1 Definition
Timed metadata tracks use MetaDataSampleEntry.
In case of XML metadata a BitRateBox in the SampleEntry can be used to choose the appropriate memory representation format (DOM, STX).
The URIMetaSampleEntry entry contains, in a box, the URI defining the form of the metadata, and optional initialization data. The format of both the samples and of the initialization data is defined by all or part of the URI form.
It may be the case that the URI identifies a format of metadata that allows there to be more than one ‘stated fact’ within each sample. However, all metadata samples in this format are effectively ‘I frames’, defining the entire set of metadata for the time interval they cover. This means that the complete set of metadata at any instant, for a given track, is contained in (a) the time-aligned samples of the track(s) (if any) describing that track, plus (b) the track metadata (if any), the movie metadata (if any) and the file metadata (if any).
If incrementally-changed metadata is needed, the MPEG-7 framework provides that capability.
Information on URI forms for some metadata systems can be found in Annex L.
Timed metadata tracks carrying dynamic depth information use DepthInformationSampleEntry.
There should be no DepthInformationBox in the trak box of a depth track associated with a timed metadata track carrying DepthInformationSampleEntry depth information. If there is a DepthInformationBox in the trak box of a depth track associated with a timed metadata track carrying DepthInformationSampleEntry depth information, the depth information in the DepthInformationBox takes precedence over the depth information in the DepthInformationSampleEntry and the associated timed metadata track shall be ignored.
12.3.3.2 Syntax
class MetaDataSampleEntry(codingname) extends SampleEntry (codingname)
{
}
class XMLMetaDataSampleEntry() extends MetaDataSampleEntry ('metx')
{
utf8string content_encoding; // optional
utf8list namespace;
utf8list schema_location; // optional
}
class TextConfigBox extends Fullbox ('txtC', 0, 0)
{
utf8string text_config;
}
class TextMetaDataSampleEntry() extends MetaDataSampleEntry ('mett')
{
utf8string content_encoding; // optional
utf8string mime_format;
TextConfigBox (); // optional
}
class MIMEBox extends Fullbox ('mime', 0, 0)
{
utf8string content_type;
}
aligned(8) class URIBox extends FullBox('uri ', version = 0, 0)
{
utf8string theURI;
}
aligned(8) class URIInitBox extends FullBox('uriI', version = 0, 0)
{
unsigned int(8) uri_initialization_data[];
}
class URIMetaSampleEntry() extends MetaDataSampleEntry ('urim')
{
URIbox the_label;
URIInitBox init; // optional
}
class DepthInformationSampleEntry() extends MetaDataSampleEntry (‘dise’){
if(is_float) {
float(32) range_min;
float(32) range_max;
} else {
unsigned int(6) units;
unsigned int(2) reserved;
}
}
12.3.3.3 Semantics
content_encoding provides a MIME type which identifies the content encoding of the timed metadata. It is defined in the same way as for an ItemInfoEntry in this document. If not present (an empty string is supplied) the timed metadata is not encoded. An example for this field is ‘application/zip’. Note that no MIME types for BiM [ISO/IEC 23001-1] and TeM [ISO/IEC 15938-1] currently exist. Thus, the experimental MIME types ‘application/x-BiM’ and ‘text/x-TeM’ shall be used to identify these encoding mechanisms.
namespace provides one or more XML namespaces to which the sample documents conform. When used for metadata, this is needed for identifying its type, e.g. gBSD or AQoS [MPEG-21-7] and for decoding using XML aware encoding mechanisms such as BiM.
schema_location provides zero or more URLs for XML schema(s) to which the sample document conforms. If there is one namespace and one schema, then this field shall be the URL of the one schema. If there is more than one namespace, then the syntax of this field shall adhere to that for xsi:schemaLocation attribute as defined by XML. When used for metadata, this is needed for decoding of the timed metadata by XML aware encoding mechanisms such as BiM.
mime_format provides a MIME type which identifies the content format of the samples. Examples for this field include ‘text/html’ and ‘text/plain’.
text_config provides the initial text of each document which is prepended before the contents of each sync sample.
content_type is a string corresponding to the MIME type each XML document carried in the stream has when delivered on its own, possibly including sub-parameters.
NOTE This implies that if two XML documents carried in the same track have different MIME types (or sub-parameters), each document is associated with a different sample entry.
theURI is a URI formatted according to the rules in 7.3.3;
uri_initialization_data is opaque data whose form is defined in the documentation of the URI form.
The semantics of the DepthInformationSampleEntry parameters are the same as the semantics of the corresponding parameters in the DepthInformationBox as specified in clause 12.11.5.1.3. An example of a sample for the DepthInformationSampleEntry is specified in Annex M.5.
5.6 Clause 12.11
Add the following new clause 12.11:
12.11 Auxiliary maps
12.11.1 General
Auxiliary maps are image representations of non-visual elementary streams. Currently ISOBMFF supports carriage of depth data and alpha data as auxiliary maps.
12.11.2 Media handler
12.11.2.1 Depth media handler
Depth video tracks use the 'depv' handler type in the HandlerBox of the MediaBox, as defined in 8.4.3.
A depth video track is coded the same as a video track, but uses this different handler type, and is not intended to be visually displayed. Depth video tracks may be linked to a video track using a reference of type 'cdsc'.
12.11.2.2 Alpha media handler
Alpha video tracks use the 'alpv' handler type in the HandlerBox of the MediaBox, as defined in 8.4.3.
An alpha video track is coded the same as a video track, but uses this different handler type, and is not intended to be visually displayed. Alpha video tracks are linked to a video track using a reference of type 'cdsc'.
12.11.3 Media header
12.11.3.1 Depth media header
Depth video tracks use the NullMediaHeaderBox in the MediaInformationBox as defined in 8.4.5.
12.11.3.2 Alpha media header
Alpha video tracks use the NullMediaHeaderBox in the MediaInformationBox as defined in 8.4.5.
12.11.4 Sample entry
12.11.4.1 Depth media sample entry
Depth video tracks use VisualSampleEntry.
12.11.4.2 Alpha media sample entry
Alpha video tracks use VisualSampleEntry.
12.11.5 Map Information box
12.11.5.1 Depth information box
12.11.5.1.1 Definition
Box Type: | 'depx' |
Container: | VisualSampleEntry |
Mandatory: | No |
Quantity: | One |
The DepthInformationBox may be used to provide information independent of the coding, to interpret the depth data.
12.11.5.1.2 Syntax
class DepthInformationBox extends FullBox ('depx', version = 0, flags)
{
computed bit is_float = (flags & 2) != 0;
if (is_float){
float(32) range_min;
float(32) range_max;
} else {
unsigned int(16) range_min;
unsigned int(16) range_max;
}
float(32) near_plane;
float(32) far_plane;
unsigned int(6) units;
unsigned int(2) reserved;
}
12.11.5.1.3 Semantics
version is an integer that specifies the version of this box.
flags is a 24-bit integer with flags; the following values are defined:
is_float: Flag mask is 0x000001. If value is 0, range_min and range_max are integer numbers. If value is 1, range_min and range_max are floating-point numbers.
depth_mapping_type: Flag mask is 0x000002. The value indicates the type of mapping between decoded sample values and depth values, as specified in Table 15.
Table 15 — Definition of depth_mapping_type.
depth_mapping_type | Interpretation |
0 | When depth_mapping_type is equal to 0, a linear relationship is defined between the decoded sample values and the depth values, where the decoded sample value equal to range_min corresponds to near_plane, and the decoded sample value equal to range_max corresponds to far_plane. |
1 | When depth_mapping_type is equal to 1, an inverse relationship is defined between the decoded sample values and the depth values, where the decoded sample value equal to range_min corresponds to the inverse of near_plane, and the decoded sample value equal to range_max corresponds to the inverse of far_plane. |
near_plane and far_plane specify the nearest and the farthest depth values, respectively. The near_plane value can be smaller than the far_plane value, or it can be greater than the far_plane value, but not equal. When near_plane value is greater than the far_plane value, larger decoded sample values are assigned to smaller depth values. When depth_mapping_type is equal to 1, neither of the near_plane value and the far_plane value shall be equal to zero.
range_min and range_max specify the minimum and the maximum decoded sample values respectively, defining the range of decoded sample values that represent depth values. The range_max value shall be greater than the range_min value. If a decoded sample value is outside of the specified range, it shall be ignored.
NOTE the type float(32) of range_min and range_max can support the whole range of values for decoded sample value types of both float(32) and unsigned int(16).
units specifies the units of the depth values, as follows:
0: unspecified
1: the values are in meters
2: the values are in millimetres
3-63: reserved.
12.11.5.2 Alpha information box
12.11.5.2.1 Definition
Box Type: | 'alpi' |
Container: | VisualSampleEntry |
Mandatory: | No |
Quantity: | One |
The AlphaInformationBox may be used to provide information independent of the coding, to interpret the alpha data.
The AlphaInformationBox is optional and if it is absent it is assumed that a value of 0 indicates full transparency and a value equal to 2^bit_depth - 1 indicates full opacity; where bit_depth is the number of bits used to represent a data point from the original component(s) of an image (e.g. in the case of monochrome image, the bit depth of the luma samples).
12.11.5.2.2 Syntax
class AlphaInformationBox extends FullBox ('alpi', version = 0, flags){
unsigned int(16) opaque_value;
unsigned int(16) transparent_value;
}
12.11.5.2.3 Semantics
version is an integer that specifies the version of this box.
flags is a 24-bit integer with flags; the following values are defined:
premultiplication_mode: Flag mask is 0x000004. Specifies if the pixel values of the associated colour or grayscale channel(s) are premultiplied by the alpha values as follows:
0: the pixel values of the associated colour or grayscale channel(s) are not premultiplied by the alpha values.
1: the pixel values of the associated colour or grayscale channel(s) have been premultiplied by the alpha values in non-linear RGB signal space, obtained by converting the pixel values to RGB according to a MatrixCoefficients value and VideoFullRangeFlag value as defined in ISO/IEC 23091-2.
2: the pixel values of the associated colour or grayscale channel(s) have been premultiplied by the alpha values in linear RGB display light space, obtained by converting the pixel values to RGB according to a MatrixCoefficients value and VideoFullRangeFlag value as defined in ISO/IEC 23091-2, then applying an electro-optical transfer function according to the transfer function signaled in an ICC profile if present, or to a TransferCharacteristics value as defined in ISO/IEC 23091-2 otherwise.
3: reserved.
opaque_value specifies the alpha value for which the referenced video track values are considered opaque for the purposes of alpha blending.
transparent_value specifies the alpha value for which the referenced video track values are considered transparent for the purposes of alpha blending.
12.11.6 Auxiliary Information box
12.11.6.1 Invalid depth band box
12.11.6.1.1 Definition
Box Type: | 'indb' |
Container: | DepthInformationBox |
Mandatory: | No |
Quantity: | One |
The InvalidDepthBandBox is used to provide information related to the presence of an invalid depth band in a depth image, independently of the coding.
12.11.6.1.2 Syntax
aligned(8) class InvalidDepthBandBox extends Box('indb') {
unsigned int(2) band_side;
unsigned int(14) band_length;
}
12.11.6.1.3 Semantics
band_side specifies the side of the depth image in which the invalid depth band is located. When the band_side value is equal to 0, the band is located at the left side of the depth image. Analogously, when the band_side value is equal to 1, 2, and 3, the band is located at the top, right and bottom side, respectively.
band_length specifies the length of the invalid depth band in the cropped depth image in pixel dimensions. When the band_side value is equal to 0 or 2, the band_length value indicates the width in pixel dimensions. When the band_side value is equal to 1 or 3, the band_length value indicates the height in pixel dimensions.
6 Annex B Guidance on deriving from this document
Add the following clause B.2.3 to Annex B “Guidance on deriving from this document”:
B.2.3 Deprecation of identifiers
Derived specifications ought to maintain a list of boxes and other identifiers that were specified in earlier versions but are now deprecated.
When documenting deprecated identifiers, the specification ought to indicate the version in which the deprecation occurred or reference the last version where the element was valid. Deprecated identifiers ought not be reassigned or reused for new definitions in future versions of any specification. While these deprecated identifiers ought not be used in new content, maintaining their documentation supports compatibility with legacy content and assists implementers working with historical specifications. This documentation may be maintained in an informative annex.
7 Annex L Depreciation of boxes and identifiers
Add the following new Annex L:
(normative)
Depreciation of boxes and identifiers- Overview
- Deprecated Boxes
The following table L.1 provides an overview of all boxes which were defined by previous editions of this specification but are now deprecated. These box identifiers shall not get reassigned for future definitions.
Table L.1 — Previously defined boxes
Four-CC | Description | Clause in last containing edition | Deprecated with edition-# |
mere | Metabox Relation box | 8.11.8 | 6 |
meco | Additional metadata container box | 8.11.7 | 6 |
imif | IPMPInfoBox | 13.4.4 | 5 or earlier |
ipmc | IPMP control box | 13.4.5 | 5 or earlier |
stsl | Sample scale box | 8.5.4 | 5 or earlier |
8 Annex M Handling of depth and alpha maps
Add the following new Annex M:
(informative)
Handling of depth and alpha maps- General
ISOBMFF handles carriage of depth and alpha maps in a codec-agnostic way. This annex defines generic rules on how to use the encapsulated depth and alpha data.
- Handling of parameters in SEI messages and container level
NOTE Some of the following rules are the consequences of the file format constraints defined for the carriage of depth and alpha track (see Clause 12.11).
It is recommended:
1. If a Depth representation information SEI message, as specified in AVC [37], HEVC [38] or VVC/VSEI [36], with depth_representation_type equal to 1 or 3 is present, the DepthInformationBox should not be present. If present, the parameters of DepthInformationBox shall be ignored.
2. If a Depth representation information SEI message, as specified in AVC [37], HEVC [38] or VVC/VSEI [36], with depth_representation_type equal to 0 or 2, and a DepthInformationBox are both present, the parameters of the DepthInformationBox should be set as follows:
a. If depth_representation_type = 0, then depth_mapping_type = 1, range_min = 0, and range_max = maxVal, with maxVal = 2^bit_depth – 1.
b. If depth_representation_type = 2, then depth_mapping_type =0, range_min = 0, and range_max = maxVal, with maxVal = 2^bit_depth - 1.
c. If z_far_flag = 1, then near_plane = ZFar and units = unspecified.
d. If z_near_flag = 1, then far_plane = ZNear and units = unspecified.
3. If an Alpha channel information SEI message, as specified in HEVC [38] or VVC/VSEI [36], and an AlphaInformationBox are both present, the parameters of the AlphaInformationBox should be set as follows:
a. If alpha_channel_use_idc = 0, then premultiplication_mode = 0
b. If alpha_channel_use_idc = 1, then premultiplication_mode = 1 or 2
c. transparent_value = alpha_transparent_value
d. opaque_value = alpha_opaque_value
- Lossless conversion between SEI-based and IEEE 754 32-bit floating-point formats
- SEI-based floating point format
- Lossless conversion between SEI-based and IEEE 754 32-bit floating-point formats
The SEI-based refers to the floating-point format specified by the Depth representation information element syntax, provided in clause G.14.2.4.2 of HEVC [38] and given in Table M.1.
Table M.2 — Depth representation information element syntax, as defined in HEVC, clause G.14.2.4.2 [38]
Depth_rep_info_element( OutSign, OutExp, OutMantissa, OutManLen ) | Descriptor |
da_sign_flag | u(1) |
da_exponent | u(7) |
da_mantissa_len_minus1 | u(5) |
da_mantissa | u(v) |
The three main fields of this floating-point format are defined as follows:
— Sign: 1 bit
— Exponent: 7 bits | Bias = 31 | Unbiased exponent values = [-30, 95]
— Mantissa: 1-32 bits
The length of the mantissa field (i.e., 1 to 32 bits) is determined by the da_mantissa_len_minus1 parameter.
- IEEE 754 32-bit floating-point format
The IEEE 754 32-bit refers to the single precision floating-point format specified in the IEEE 754 standard [39].
The three main fields of this floating-point format are defined as follows:
— Sign: 1 bit
— Exponent: 8 bits | Bias = 127 | Unbiased exponent values = [-126, 127]
— Mantissa: 23 bits
- Lossless conversion rules
There are two rules that should be met for lossless conversion of the SEI-based floating point format () to the IEEE 754 32-bit floating-point format (
), within the common range of real (representable) numbers (
):
1. The mantissa length of must be set equal to 23, and
2. The unbiased exponent values of and
must be equal, according to Equation 3
(3)
where represents the unsigned integer value of the exponent field of
and takes a value in the range 0 to 127,
represents the bias specified for
and is equal to
,
represents the unsigned integer value of the exponent field of
and takes a value in the range 0 to 255,
represents the bias specified for
and is equal to
, and
is determined by the smallest negative and the largest positive real (representable) number of X.
NOTE 1 Special cases, such as the is equal to 0 and 127, require special handling.
NOTE 2 To respect all aforementioned definitions, should take a value in the range 97 to 222.
NOTE 3 The smallest negative number represented by X is , and the largest positive number represented by X is
.
- Interpretation of depth decoded sample values
Decoded sample values are interpreted as depth values
that are computed based on the depth_mapping_type value, as specified in Table 15, and the variables
and
, and the near_plane, far_plane, range_min, and range_max values, according to Equation 4:
(4)
where .
Indicative graphs representing Equation 4 are shown in Figure M.2. The left plot corresponds to a graph representing a linear relationship between the decoded sample values and the depth values, obtained for depth_mapping_type value equal to 0. The right plot corresponds to a graph representing an inverse relationship between the decoded sample values and the depth values, obtained for depth_mapping_type value equal to 1. In both graphs, it is assumed that the near_plane value is smaller than the far_plane value, and the near_plane and far_plane values are greater than 0.
Figure M.2 — Graphs obtained for depth_mapping_type value equal to 0 (left) and for depth_mapping_type value equal to 1 (right)
- Example depth information sample
A sample for the DepthInformationSampleEntry can be defined as follows:
class DepthInformationSample(){
float(32) near_plane;
float(32) far_plane;
}
9 Bibliography
Add the following references in the bibliography:
[36] ISO/IEC 23002-7, Information technology — MPEG video technologies — Part 7: Versatile supplemental enhancement information messages for coded video bitstreams
[37] ISO/IEC 14496-10, Information technology — Coding of audio-visual objects — Part 10: Advanced Video Coding (AVC)
[38] ISO/IEC 23008-2, Information technology — High efficiency coding and media delivery in heterogeneous environments — Part 2: High efficiency video coding (HEVC)
[39] IEEE Computer Society (2019-07-22). IEEE Standard for Floating-Point Arithmetic. IEEE STD 754-2019. IEEE
