Skip to content

Schema APIs

deeplake.Schema

The schema of a deeplake.Dataset.

__getitem__

__getitem__(column: str) -> ColumnDefinition

Return the column definition by name

__len__

__len__() -> int

The number of columns within the schema

columns property

columns: list[ColumnDefinition]

A list of all columns within the schema

deeplake.SchemaView

A read-only view of a deeplake.Dataset deeplake.Schema.

__getitem__

__getitem__(column: str) -> ColumnDefinitionView

Return the column definition by name

__len__

__len__() -> int

The number of columns within the schema

columns property

columns: list[ColumnDefinitionView]

A list of all columns within the schema

deeplake.ColumnDefinition

drop

drop() -> None

Drops the column from the dataset.

dtype property

dtype: Type

The column datatype

name property

name: str

The name of the column

rename

rename(new_name: str) -> None

Renames the column

Parameters:

Name Type Description Default
new_name str

The new name for the column

required

deeplake.ColumnDefinitionView

A read-only view of a deeplake.ColumnDefinition

dtype property

dtype: Type

The column datatype

name property

name: str

The name of the column

Default Schemas

COCOImages

COCOImages(
    embedding_size: int,
    quantize: bool = False,
    objects: bool = True,
    keypoints: bool = False,
    stuffs: bool = False,
) -> SchemaTemplate

A schema for storing COCO-based image data.

This schema includes the following fields: - id (uint64): Unique identifier for each entry. - image (jpg image): The image data. - url (text): URL of the image. - year (uint8): Year the image was captured. - version (text): Version of the dataset. - description (text): Description of the image. - contributor (text): Contributor of the image. - date_created (uint64): Timestamp when the image was created. - date_captured (uint64): Timestamp when the image was captured. - embedding (embedding): Embedding of the image. - license (text): License information. - is_crowd (bool): Whether the image contains a crowd.

If objects is true, the following fields are added: - objects_bbox (bounding box): Bounding boxes for objects. - objects_classes (segment mask): Segment masks for objects.

If keypoints is true, the following fields are added: - keypoints_bbox (bounding box): Bounding boxes for keypoints. - keypoints_classes (segment mask): Segment masks for keypoints. - keypoints (2-dimensional array of uint32): Keypoints data. - keypoints_skeleton (2-dimensional array of uint16): Skeleton data for keypoints.

If stuffs is true, the following fields are added: - stuffs_bbox (bounding boxes): Bounding boxes for stuffs. - stuffs_classes (segment mask): Segment masks for stuffs.

Parameters:

Name Type Description Default
embedding_size int

int Size of the embeddings.

required
quantize bool

bool, optional If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed. Default is False.

False
objects bool

bool, optional Whether to include object-related fields. Default is True.

True
keypoints bool

bool, optional Whether to include keypoint-related fields. Default is False.

False
stuffs bool

bool, optional Whether to include stuff-related fields. Default is False.

False

Examples:

Create a dataset with the standard schema:

ds = deeplake.create("tmp://", schema=deeplake.schemas.COCOImages(768))

Customize the schema before creating the dataset:

ds = deeplake.create("tmp://", schema=deeplake.schemas.COCOImages(768, objects=True, keypoints=True)
    .rename("embedding", "image_embed")
    .add("author", types.Text()))

Add a new field to the schema:

schema = deeplake.schemas.COCOImages(768)
schema.add("location", types.Text())
ds = deeplake.create("tmp://", schema=schema)

SchemaTemplate

A template that can be used for creating a new dataset with deeplake.create.

This class allows you to define and customize the schema for your dataset.

Parameters:

Name Type Description Default
schema dict[str, DataType | str | Type]

dict A dictionary where the key is the column name and the value is the data type.

required

Methods:

Name Description
add

str, dtype: deeplake._deeplake.types.DataType | str | deeplake._deeplake.types.Type) -> SchemaTemplate: Adds a new column to the template.

remove

str) -> SchemaTemplate: Removes a column from the template.

rename

str, new_name: str) -> SchemaTemplate: Renames a column in the template.

Examples:

Create a new schema template, modify it, and create a dataset with the schema:

schema = deeplake.schemas.SchemaTemplate({
    "id": types.UInt64(),
    "text": types.Text(),
    "embedding": types.Embedding(768)
})
schema.add("author", types.Text())
schema.remove("text")
schema.rename("embedding", "text_embedding")
ds = deeplake.create("tmp://", schema=schema)

__init__

__init__(schema: dict[str, DataType | str | Type]) -> None

Constructs a new SchemaTemplate from the given dict.

add

add(
    name: str, dtype: DataType | str | Type
) -> SchemaTemplate

Adds a new column to the template.

Parameters:

Name Type Description Default
name str

str The column name.

required
dtype DataType | str | Type

deeplake._deeplake.types.DataType | str | deeplake._deeplake.types.Type The column data type.

required

Returns:

Name Type Description
SchemaTemplate SchemaTemplate

The updated schema template.

Examples:

Add a new column to the schema:

schema = deeplake.schemas.SchemaTemplate({})
schema.add("author", types.Text())

remove

remove(name: str) -> SchemaTemplate

Removes a column from the template.

Parameters:

Name Type Description Default
name str

str The column name.

required

Returns:

Name Type Description
SchemaTemplate SchemaTemplate

The updated schema template.

Examples:

Remove a column from the schema:

schema = deeplake.schemas.SchemaTemplate({
    "id": types.UInt64(),
    "text": types.Text(),
    "embedding": types.Embedding(768)
})
schema.remove("text")

rename

rename(old_name: str, new_name: str) -> SchemaTemplate

Renames a column in the template.

Parameters:

Name Type Description Default
old_name str

str Existing column name.

required
new_name str

str New column name.

required

Returns:

Name Type Description
SchemaTemplate SchemaTemplate

The updated schema template.

Examples:

Rename a column in the schema:

schema = deeplake.schemas.SchemaTemplate({
    "id": types.UInt64(),
    "text": types.Text(),
    "embedding": types.Embedding(768)
})
schema.rename("embedding", "text_embedding")

TextEmbeddings

TextEmbeddings(
    embedding_size: int, quantize: bool = False
) -> SchemaTemplate

A schema for storing embedded text from documents.

This schema includes the following fields: - id (uint64): Unique identifier for each entry. - chunk_index (uint16): Position of the text chunk within the document. - document_id (uint64): Unique identifier for the document the embedding came from. - date_created (uint64): Timestamp when the document was read. - text_chunk (text): The text of the shard. - embedding (dtype=float32, size=embedding_size): The embedding of the text.

Parameters:

Name Type Description Default
embedding_size int

int Size of the embeddings.

required
quantize bool

bool, optional If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed. Default is False.

False

Examples:

Create a dataset with the standard schema:

ds = deeplake.create("tmp://", schema=deeplake.schemas.TextEmbeddings(768))

Customize the schema before creating the dataset:

ds = deeplake.create("tmp://", schema=deeplake.schemas.TextEmbeddings(768)
    .rename("embedding", "text_embed")
    .add("author", types.Text()))

Add a new field to the schema:

schema = deeplake.schemas.TextEmbeddings(768)
schema.add("language", types.Text())
ds = deeplake.create("tmp://", schema=schema)

Storage Formats

deeplake.formats.DataFormat

Base class for all datafile formats.

deeplake.formats.Chunk

Chunk(
    sample_compression: str | None = None,
    chunk_compression: str | None = None,
) -> DataFormat

Configures a "chunk" datafile format

Parameters:

Name Type Description Default
sample_compression str

How to compress individual values within the datafile

None
chunk_compression str

How to compress the datafile as a whole

None