Deeplake Types

Deep Lake supports a wide variety of data types for your datasets.

When creating a new column the data type can be defined in multiple ways, which always convert to one of the below datatypes: - Call the below functions directly, e.g. deeplake.types.Text() - If the below function does not take arguments, simply pass the function, e.g. deeplake.types.Text - A string containing the type name, e.g. "text" - A standard python type str - A numpy type np.str_

ds.add_column("col1", deeplake.types.Text())
ds.add_column("col2", deeplake.types.Text)
ds.add_column("col2", "text")
ds.add_column("col3", str)
ds.add_column("col1", np.str_)

All Data Types

Note

For simplicity, all samples assume the following setup code:

import deeplake
from deeplake import types

ds = deeplake.create("mem://test")

deeplake.types.Array

Array(dtype: DataType | str, dimensions: int) -> DataType

Array(dtype: DataType | str, shape: list[int]) -> DataType

Array(
    dtype: DataType | str, dimensions: int, shape: list[int]
) -> DataType

A generic array of data.

Parameters:

Name	Type	Description	Default
`dtype`	`DataType \| str`	The datatype of values in the array	required
`dimensions`	`int`	The number of dimensions/axies in the array. Unlike specifying `shape`, there is no constraint on the size of each dimension.	required
`shape`	`list[int]`	Constrain the size of each dimension in the array	required

Examples:

>>> # Create a three-dimensional array, where each dimension can have any number of elements
>>> ds.add_column("col1", types.Array("int32", dimensions=3))
>>>
>>> # Create a three-dimensional array, where each dimension has a known size
>>> ds.add_column("col2", types.Array(types.Float32(), shape=[50, 30, 768]))

deeplake.types.Binary `module-attribute`

Binary: QuantizationType

deeplake.types.BinaryMask

BinaryMask(
    sample_compression: str | None = None,
    chunk_compression: str | None = None,
) -> Type

In binary mask, pixel value is a boolean for whether there is/is-not an object of a class present.

NOTE: Since binary masks often contain large amounts of data, it is recommended to compress them using lz4.

Parameters:

Name	Type	Description	Default
`sample_compression`	`str \| None`	How to compress each row's value. Possible values: lz4, null (default: null)	`None`
`chunk_compression`	`str \| None`	How to compress all the values stored in a single file. Possible values: lz4, null (default: null)	`None`

Examples:

>>> ds.add_column("col1", types.BinaryMask(sample_compression="lz4"))
>>> ds.append(np.zeros((512, 512, 5), dtype="bool"))

deeplake.types.Bool

Bool() -> DataType

A boolean value

Examples:

>>> ds.add_column("col1", types.Bool)
>>> ds.add_column("col2", "bool")

deeplake.types.BoundingBox

BoundingBox(
    dtype: DataType | str = "float32",
    format: str | None = None,
    bbox_type: str | None = None,
) -> Type

Stores an array of values specifying the bounding boxes of an image.

Parameters:

Name	Type	Description	Default
`dtype`	`DataType \| str`	The datatype of values (default float32)	`'float32'`
`format`	`str \| None`	The bounding box format. Possible values: `ccwh`, `tlwh`, `tlbr`, `unknown`	`None`
`bbox_type`	`str \| None`	The pixel type. Possible values: `pixel`, `fractional`	`None`

Examples:

>>> ds.add_column("col1", types.BoundingBox())
>>> ds.add_column("col2", types.BoundingBox(format="tlwh"))

deeplake.types.Dict

Dict() -> Type

Supports storing arbitrary key/value pairs in each row.

See deeplake.types.Struct for a type that supports defining allowed keys.

Examples:

>>> ds.add_column("col1", types.Dict)
>>>
>>> ds.append([{"col1", {"a": 1, "b": 2}}])
>>> ds.append([{"col1", {"b": 3, "c": 4}}])

deeplake.types.Embedding

Embedding(
    size: int,
    dtype: DataType | str = "float32",
    quantization: QuantizationType | None = None,
) -> Type

A single-dimensional embedding of a given length. See deeplake.types.Array for a multidimensional array.

Parameters:

Name	Type	Description	Default
`size`	`int`	The size of the embedding	required
`dtype`	`DataType \| str`	The datatype of the embedding. Defaults to float32	`'float32'`
`quantization`	`QuantizationType \| None`	How to compress the embeddings in the index. Default uses no compression, but can be set to deeplake.types.QuantizationType.Binary	`None`

Examples:

>>> ds.add_column("col1", types.Embedding(768))
>>> ds.add_column("col2", types.Embedding(768, quantization=types.QuantizationType.Binary))

deeplake.types.Float32

Float32() -> DataType

A 32-bit float value

Examples:

>>> ds.add_column("col1", types.Float)

deeplake.types.Float64

Float64() -> DataType

A 64-bit float value

Examples:

>>> ds.add_column("col1", types.Float64)

deeplake.types.Image

Image(
    dtype: DataType | str = "uint8",
    sample_compression: str = "png",
) -> Type

An image of a given format. The value returned will be a multidimensional array of values rather than the raw image bytes.

Available formats:

png (default)
apng
jpg / jpeg
tiff / tif
jpeg2000 / jp2
bmp
nii
nii.gz
dcm

Parameters:

Name	Type	Description	Default
`dtype`	`DataType \| str`	The data type of the array elements to return	`'uint8'`
`sample_compression`	`str`	The on-disk compression/format of the image	`'png'`

Examples:

>>> ds.add_column("col1", types.Sequence(types.Image))
>>> ds.add_column("col1", types.Sequence(types.Image(sample_compression="jpg")))

deeplake.types.Int16

Int16() -> DataType

A 16-bit integer value

Examples:

>>> ds.add_column("col1", types.Int16)

deeplake.types.Int32

Int32() -> DataType

A 32-bit integer value

Examples:

>>> ds.add_column("col1", types.Int32)

deeplake.types.Int64

Int64() -> DataType

A 64-bit integer value

Examples:

>>> ds.add_column("col1", types.Int64)

deeplake.types.Int8

Int8() -> DataType

An 8-bit integer value

Examples:

>>> ds.add_column("col1", types.Int8)

deeplake.types.SegmentMask

SegmentMask(
    dtype: DataType | str = "uint8",
    sample_compression: str | None = None,
    chunk_compression: str | None = None,
) -> Type

Segmentation masks are 2D representations of class labels where a numerical class value is encoded in an array of same shape as the image.

NOTE: Since segmentation masks often contain large amounts of data, it is recommended to compress them using lz4.

Parameters:

Name	Type	Description	Default
`sample_compression`	`str \| None`	How to compress each row's value. Possible values: lz4, null (default: null)	`None`
`chunk_compression`	`str \| None`	How to compress all the values stored in a single file. Possible values: lz4, null (default: null)	`None`

Examples:

>>>  ds.add_column("col1", types.SegmentMask(sample_compression="lz4"))
>>>  ds.append("col1", np.zeros((512, 512)))

deeplake.types.Sequence

Sequence(nested_type: DataType | str | Type) -> Type

A sequence is a list of other data types, where there is a order to the values in the list.

For example, a video can be stored as a sequence of images to better capture the time-based ordering of the images rather than simply storing them as an Array

Parameters:

Name	Type	Description	Default
`nested_type`	`DataType \| str \| Type`	The data type of the values in the sequence. Can be any data type, not just primitive types.	required

Examples:

>>> ds.add_column("col1", types.Sequence(types.Image(sample_compression="jpeg")))

deeplake.types.Struct

Struct(fields: dict[str, DataType | str]) -> DataType

Defines a custom datatype with specified keys.

See deeplake.types.Dict for a type that supports different key/value pairs per value.

Parameters:

Name	Type	Description	Default
`fields`	`dict[str, DataType \| str]`	A dict where the key is the name of the field, and the value is the datatype definition for it	required

Examples:

>>> ds.add_column("col1", types.Struct({
>>>    "field1": types.Int16(),
>>>    "field2": types.Text(),
>>> }))
>>>
>>> ds.append([{"col1": {"field1": 3, "field2": "a"}}])
>>> print(ds[0]["col1"]["field1"])

deeplake.types.Text

Text(index_type: str | TextIndexType | None = None) -> Type

Text data of arbitrary length.

Options for index_type are:

deeplake.types.Inverted
deeplake.types.BM25

Parameters:

Name	Type	Description	Default
`index_type`	`str \| TextIndexType \| None`	How to index the data in the column for faster searching. Default is `None` meaning "do not index"	`None`

Examples:

>>> ds.add_column("col1", types.Text)
>>> ds.add_column("col2", "text")
>>> ds.add_column("col3", str)
>>> ds.add_column("col4", types.Text(index_type=types.Inverted))
>>> ds.add_column("col4", types.Text(index_type=types.BM25))

deeplake.types.UInt16

UInt16() -> DataType

An unsigned 16-bit integer value

Examples:

>>> ds.add_column("col1", types.UInt16)

deeplake.types.UInt32

UInt32() -> DataType

An unsigned 32-bit integer value

Examples:

>>> ds.add_column("col1", types.UInt16)

deeplake.types.UInt64

UInt64() -> DataType

An unsigned 64-bit integer value

Examples:

>>> ds.add_column("col1", types.UInt64)

deeplake.types.UInt8

UInt8() -> DataType

An unsigned 8-bit integer value

Examples:

>>> ds.add_column("col1", types.UInt16)

Text Index Types

deeplake.types.BM25 `module-attribute`

BM25: TextIndexType

A BM25 based index of text data.

This index can be used with BM25_SIMILARITY(column, 'search text') in a TQL ORDER BY clause.

deeplake.types.Inverted `module-attribute`

Inverted: TextIndexType

A text index that supports keyword lookup.

This index can be used with CONTAINS(column, 'wanted_value').

Embedding Quantization

deeplake.types.QuantizationType.Binary `class-attribute`

Binary: QuantizationType

Stores a binary quantized representation of the original embedding in the index rather than the a full copy of the embedding.

This slightly decreases accuracy of searches, while significantly improving query time.

Base Classes

deeplake.types.DataType

The base class all specific types extend from.

deeplake.types.Type

data_type `property`

data_type: DataType

default_format `property`

default_format: DataFormat

id `property`

id: str

The id (name) of the data type

is_sequence `property`

is_sequence: bool

kind `property`

kind: TypeKind

shape `property`

shape: list[int] | None

The shape of the data type if applicable. Otherwise none

deeplake.types.TextIndexType

Members:

Inverted

BM25

name `property`

name: str

value `property`

value: int

deeplake.types.QuantizationType

name `property`

name: str

value `property`

value: int

Deeplake Types

All Data Types

deeplake.types.Array

deeplake.types.Binary module-attribute

deeplake.types.BinaryMask

deeplake.types.Bool

deeplake.types.BoundingBox

deeplake.types.Dict

deeplake.types.Embedding

deeplake.types.Float32

deeplake.types.Float64

deeplake.types.Image

deeplake.types.Int16

deeplake.types.Int32

deeplake.types.Int64

deeplake.types.Int8

deeplake.types.SegmentMask

deeplake.types.Sequence

deeplake.types.Struct

deeplake.types.Text

deeplake.types.UInt16

deeplake.types.UInt32

deeplake.types.UInt64

deeplake.types.UInt8

Text Index Types

deeplake.types.BM25 module-attribute

deeplake.types.Inverted module-attribute

Embedding Quantization

deeplake.types.QuantizationType.Binary class-attribute

Base Classes

deeplake.types.DataType

deeplake.types.Type

data_type property

default_format property

id property

is_sequence property

kind property

shape property

deeplake.types.TextIndexType

name property

value property

deeplake.types.QuantizationType

name property

value property

deeplake.types.Binary `module-attribute`

deeplake.types.BM25 `module-attribute`

deeplake.types.Inverted `module-attribute`

deeplake.types.QuantizationType.Binary `class-attribute`

data_type `property`

default_format `property`

id `property`

is_sequence `property`

kind `property`

shape `property`

name `property`

value `property`

name `property`

value `property`