Skip to content

Schema APIs

deeplake.Schema

The schema of a deeplake.Dataset.

__getitem__

__getitem__(column: str) -> ColumnDefinition

Return the column definition by name

__len__

__len__() -> int

The number of columns within the schema

columns property

columns: list[ColumnDefinition]

A list of all columns within the schema

deeplake.SchemaView

A read-only view of a deeplake.Dataset deeplake.Schema.

__getitem__

__getitem__(column: str) -> ColumnDefinitionView

Return the column definition by name

__len__

__len__() -> int

The number of columns within the schema

columns property

columns: list[ColumnDefinitionView]

A list of all columns within the schema

deeplake.ColumnDefinition

drop

drop() -> None

Drops the column from the dataset.

dtype property

dtype: Type

The column datatype

name property

name: str

The name of the column

rename

rename(new_name: str) -> None

Renames the column

Parameters:

Name Type Description Default
new_name str

The new name for the column

required

deeplake.ColumnDefinitionView

A read-only view of a deeplake.ColumnDefinition

dtype property

dtype: Type

The column datatype

name property

name: str

The name of the column

Default Schemas

COCOImages

COCOImages(
    embedding_size: int,
    quantize: bool = False,
    objects: bool = True,
    keypoints: bool = False,
    stuffs: bool = False,
) -> SchemaTemplate

A schema for storing COCO-based image data.

- id (uint64)
- image (jpg image)
- url (text)
- year (uint8)
- version (text)
- description (text)
- contributor (text)
- date_created (uint64)
- date_captured (uint64)
- embedding (embedding)
- license (text)
- is_crowd (bool)

If objects is true, the following fields are added: - objects_bbox (bounding box) - objects_classes (segment mask)

If keypoints is true, the following fields are added: - keypoints_bbox (bounding box) - keypoints_classes (segment mask) - keypoints (2-dimensional array of uint32) - keypoints_skeleton (2-dimensional array of uint16)

if stuffs is true, the following fields are added: - stuffs_bbox (bounding boxes) - stuffs_classes (segment mask)

Parameters:

Name Type Description Default
embedding_size int

Size of the embeddings

required
quantize bool

If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed

False

Examples:

>>> # Create a dataset with the standard schema
>>> ds = deeplake.create("ds_path",
>>>     schema=deeplake.schemas.COCOImages(768).build())
>>> # Customize the schema before creating the dataset
>>> ds = deeplake.create("ds_path", schema=deeplake.schemas.COCOImages(768,
>>>         objects=True, keypoints=True)
>>>     .rename("embedding", "image_embed")
>>>     .add("author", types.Text()).build())

SchemaTemplate

A template that can be used for creating a new dataset with deeplake.create

__init__

__init__(schema: dict[str, DataType | str | Type]) -> None

Constructs a new SchemaTemplate from the given dict

add

add(
    name: str, dtype: DataType | str | Type
) -> SchemaTemplate

Adds a new column to the template

Parameters:

Name Type Description Default
name str

The column name

required
dtype DataType | str | Type

The column data type

required

remove

remove(name: str) -> SchemaTemplate

Removes a column from the template

Parameters:

Name Type Description Default
name str

The column name

required

rename

rename(old_name: str, new_name: str) -> SchemaTemplate

Renames a column in the template.

Parameters:

Name Type Description Default
old_name str

Existing column name

required
new_name str

New column name

required

TextEmbeddings

TextEmbeddings(
    embedding_size: int, quantize: bool = False
) -> SchemaTemplate

A schema for storing embedded text from documents.

  • id (uint64)
  • chunk_index (uint16) Position of the text_chunk within the document
  • document_id (uint64) Unique identifier for the document the embedding came from
  • date_created (uint64) Timestamp the document was read
  • text_chunk (text) The text of the shard
  • embedding (dtype=float32, size=embedding_size) The embedding of the text

Parameters:

Name Type Description Default
embedding_size int

Size of the embeddings

required
quantize bool

If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed

False

Examples:

>>> # Create a dataset with the standard schema
>>> ds = deeplake.create("ds_path",
>>>         schema=deeplake.schemas.TextEmbeddings(768).build())
>>> # Customize the schema before creating the dataset
>>> ds = deeplake.create("ds_path", schema=deeplake.schemas.TextEmbeddings(768)
>>>         .rename("embedding", "text_embed")
>>>         .add("author", types.Text())
>>>         .build())

Storage Formats

deeplake.formats.DataFormat

Base class for all datafile formats.

deeplake.formats.Chunk

Chunk(
    sample_compression: str | None = None,
    chunk_compression: str | None = None,
) -> DataFormat

Configures a "chunk" datafile format

Parameters:

Name Type Description Default
sample_compression str

How to compress individual values within the datafile

None
chunk_compression str

How to compress the datafile as a whole

None