Schema APIs

deeplake.Schema

getitem

__getitem__(column: str) -> ColumnDefinition

Return the column definition by name

len

__len__() -> int

The number of columns within the schema

columns `property`

columns: list[ColumnDefinition]

A list of all columns within the schema

deeplake.SchemaView

A read-only view of a deeplake.Dataset deeplake.Schema.

getitem

__getitem__(column: str) -> ColumnDefinitionView

Return the column definition by name

len

__len__() -> int

The number of columns within the schema

columns `property`

columns: list[ColumnDefinitionView]

A list of all columns within the schema

deeplake.ColumnDefinition

drop

drop() -> None

Drops the column from the dataset.

dtype `property`

dtype: Type

The column datatype

name `property`

name: str

The name of the column

rename

rename(new_name: str) -> None

Renames the column

Parameters:

Name	Type	Description	Default
`new_name`	`str`	The new name for the column	required

deeplake.ColumnDefinitionView

A read-only view of a deeplake.ColumnDefinition

dtype `property`

dtype: Type

The column datatype

name `property`

name: str

The name of the column

Default Schemas

COCOImages

COCOImages(
    embedding_size: int,
    quantize: bool = False,
    objects: bool = True,
    keypoints: bool = False,
    stuffs: bool = False,
) -> SchemaTemplate

A schema for storing COCO-based image data.

- id (uint64)
- image (jpg image)
- url (text)
- year (uint8)
- version (text)
- description (text)
- contributor (text)
- date_created (uint64)
- date_captured (uint64)
- embedding (embedding)
- license (text)
- is_crowd (bool)

If objects is true, the following fields are added: - objects_bbox (bounding box) - objects_classes (segment mask)

If keypoints is true, the following fields are added: - keypoints_bbox (bounding box) - keypoints_classes (segment mask) - keypoints (2-dimensional array of uint32) - keypoints_skeleton (2-dimensional array of uint16)

if stuffs is true, the following fields are added: - stuffs_bbox (bounding boxes) - stuffs_classes (segment mask)

Parameters:

Name	Type	Description	Default
`embedding_size`	`int`	Size of the embeddings	required
`quantize`	`bool`	If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed	`False`

Examples:

>>> # Create a dataset with the standard schema
>>> ds = deeplake.create("ds_path",
>>>     schema=deeplake.schemas.COCOImages(768).build())

>>> # Customize the schema before creating the dataset
>>> ds = deeplake.create("ds_path", schema=deeplake.schemas.COCOImages(768,
>>>         objects=True, keypoints=True)
>>>     .rename("embedding", "image_embed")
>>>     .add("author", types.Text()).build())

SchemaTemplate

A template that can be used for creating a new dataset with deeplake.create

init

__init__(schema: dict[str, DataType | str | Type]) -> None

Constructs a new SchemaTemplate from the given dict

add

add(
    name: str, dtype: DataType | str | Type
) -> SchemaTemplate

Adds a new column to the template

Parameters:

Name	Type	Description	Default
`name`	`str`	The column name	required
`dtype`	`DataType \| str \| Type`	The column data type	required

remove

remove(name: str) -> SchemaTemplate

Removes a column from the template

Parameters:

Name	Type	Description	Default
`name`	`str`	The column name	required

rename

rename(old_name: str, new_name: str) -> SchemaTemplate

Renames a column in the template.

Parameters:

Name	Type	Description	Default
`old_name`	`str`	Existing column name	required
`new_name`	`str`	New column name	required

TextEmbeddings

TextEmbeddings(
    embedding_size: int, quantize: bool = False
) -> SchemaTemplate

A schema for storing embedded text from documents.

id (uint64)
chunk_index (uint16) Position of the text_chunk within the document
document_id (uint64) Unique identifier for the document the embedding came from
date_created (uint64) Timestamp the document was read
text_chunk (text) The text of the shard
embedding (dtype=float32, size=embedding_size) The embedding of the text

Parameters:

Name	Type	Description	Default
`embedding_size`	`int`	Size of the embeddings	required
`quantize`	`bool`	If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed	`False`

Examples:

>>> # Create a dataset with the standard schema
>>> ds = deeplake.create("ds_path",
>>>         schema=deeplake.schemas.TextEmbeddings(768).build())

>>> # Customize the schema before creating the dataset
>>> ds = deeplake.create("ds_path", schema=deeplake.schemas.TextEmbeddings(768)
>>>         .rename("embedding", "text_embed")
>>>         .add("author", types.Text())
>>>         .build())

Storage Formats

deeplake.formats.DataFormat

Base class for all datafile formats.

deeplake.formats.Chunk

Chunk(
    sample_compression: str | None = None,
    chunk_compression: str | None = None,
) -> DataFormat

Configures a "chunk" datafile format

Parameters:

Name	Type	Description	Default
`sample_compression`	`str`	How to compress individual values within the datafile	`None`
`chunk_compression`	`str`	How to compress the datafile as a whole	`None`

Schema APIs

deeplake.Schema

__getitem__

__len__

columns property

deeplake.SchemaView

__getitem__

__len__

columns property

deeplake.ColumnDefinition

drop

dtype property

name property

rename

deeplake.ColumnDefinitionView

dtype property

name property

Default Schemas

COCOImages

SchemaTemplate

__init__

add

remove

rename

TextEmbeddings

Storage Formats

deeplake.formats.DataFormat

deeplake.formats.Chunk

getitem

len

columns `property`

getitem

len

columns `property`

dtype `property`

name `property`

dtype `property`

name `property`

init