Schema APIs

deeplake.Schema

getitem

__getitem__(column: str) -> ColumnDefinition

Return the column definition by name

len

__len__() -> int

The number of columns within the schema

columns `property`

columns: list[ColumnDefinition]

A list of all columns within the schema

deeplake.SchemaView

A read-only view of a deeplake.Dataset deeplake.Schema.

getitem

__getitem__(column: str) -> ColumnDefinitionView

Return the column definition by name

len

__len__() -> int

The number of columns within the schema

columns `property`

columns: list[ColumnDefinitionView]

A list of all columns within the schema

deeplake.ColumnDefinition

drop

drop() -> None

Drops the column from the dataset.

dtype `property`

dtype: Type

The column datatype

name `property`

name: str

The name of the column

rename

rename(new_name: str) -> None

Renames the column

Parameters:

Name	Type	Description	Default
`new_name`	`str`	The new name for the column	required

deeplake.ColumnDefinitionView

A read-only view of a deeplake.ColumnDefinition

dtype `property`

dtype: Type

The column datatype

name `property`

name: str

The name of the column

Default Schemas

COCOImages

COCOImages(
    embedding_size: int,
    quantize: bool = False,
    objects: bool = True,
    keypoints: bool = False,
    stuffs: bool = False,
) -> SchemaTemplate

A schema for storing COCO-based image data.

This schema includes the following fields: - id (uint64): Unique identifier for each entry. - image (jpg image): The image data. - url (text): URL of the image. - year (uint8): Year the image was captured. - version (text): Version of the dataset. - description (text): Description of the image. - contributor (text): Contributor of the image. - date_created (uint64): Timestamp when the image was created. - date_captured (uint64): Timestamp when the image was captured. - embedding (embedding): Embedding of the image. - license (text): License information. - is_crowd (bool): Whether the image contains a crowd.

If objects is true, the following fields are added: - objects_bbox (bounding box): Bounding boxes for objects. - objects_classes (segment mask): Segment masks for objects.

If keypoints is true, the following fields are added: - keypoints_bbox (bounding box): Bounding boxes for keypoints. - keypoints_classes (segment mask): Segment masks for keypoints. - keypoints (2-dimensional array of uint32): Keypoints data. - keypoints_skeleton (2-dimensional array of uint16): Skeleton data for keypoints.

If stuffs is true, the following fields are added: - stuffs_bbox (bounding boxes): Bounding boxes for stuffs. - stuffs_classes (segment mask): Segment masks for stuffs.

Parameters:

Name	Type	Description	Default
`embedding_size`	`int`	int Size of the embeddings.	required
`quantize`	`bool`	bool, optional If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed. Default is False.	`False`
`objects`	`bool`	bool, optional Whether to include object-related fields. Default is True.	`True`
`keypoints`	`bool`	bool, optional Whether to include keypoint-related fields. Default is False.	`False`
`stuffs`	`bool`	bool, optional Whether to include stuff-related fields. Default is False.	`False`

Examples:

Create a dataset with the standard schema:

ds = deeplake.create("tmp://", schema=deeplake.schemas.COCOImages(768))

Customize the schema before creating the dataset:

ds = deeplake.create("tmp://", schema=deeplake.schemas.COCOImages(768, objects=True, keypoints=True)
    .rename("embedding", "image_embed")
    .add("author", types.Text()))

Add a new field to the schema:

schema = deeplake.schemas.COCOImages(768)
schema.add("location", types.Text())
ds = deeplake.create("tmp://", schema=schema)

SchemaTemplate

A template that can be used for creating a new dataset with deeplake.create.

This class allows you to define and customize the schema for your dataset.

Parameters:

Name	Type	Description	Default
`schema`	`dict[str, DataType \| str \| Type]`	dict A dictionary where the key is the column name and the value is the data type.	required

Methods:

Name	Description
`add`	str, dtype: deeplake._deeplake.types.DataType \| str \| deeplake._deeplake.types.Type) -> SchemaTemplate: Adds a new column to the template.
`remove`	str) -> SchemaTemplate: Removes a column from the template.
`rename`	str, new_name: str) -> SchemaTemplate: Renames a column in the template.

Examples:

Create a new schema template, modify it, and create a dataset with the schema:

schema = deeplake.schemas.SchemaTemplate({
    "id": types.UInt64(),
    "text": types.Text(),
    "embedding": types.Embedding(768)
})
schema.add("author", types.Text())
schema.remove("text")
schema.rename("embedding", "text_embedding")
ds = deeplake.create("tmp://", schema=schema)

init

__init__(schema: dict[str, DataType | str | Type]) -> None

Constructs a new SchemaTemplate from the given dict.

add

add(
    name: str, dtype: DataType | str | Type
) -> SchemaTemplate

Adds a new column to the template.

Parameters:

Name	Type	Description	Default
`name`	`str`	str The column name.	required
`dtype`	`DataType \| str \| Type`	deeplake._deeplake.types.DataType \| str \| deeplake._deeplake.types.Type The column data type.	required

Returns:

Name	Type	Description
`SchemaTemplate`	`SchemaTemplate`	The updated schema template.

Examples:

Add a new column to the schema:

schema = deeplake.schemas.SchemaTemplate({})
schema.add("author", types.Text())

remove

remove(name: str) -> SchemaTemplate

Removes a column from the template.

Parameters:

Name	Type	Description	Default
`name`	`str`	str The column name.	required

Returns:

Name	Type	Description
`SchemaTemplate`	`SchemaTemplate`	The updated schema template.

Examples:

Remove a column from the schema:

schema = deeplake.schemas.SchemaTemplate({
    "id": types.UInt64(),
    "text": types.Text(),
    "embedding": types.Embedding(768)
})
schema.remove("text")

rename

rename(old_name: str, new_name: str) -> SchemaTemplate

Renames a column in the template.

Parameters:

Name	Type	Description	Default
`old_name`	`str`	str Existing column name.	required
`new_name`	`str`	str New column name.	required

Returns:

Name	Type	Description
`SchemaTemplate`	`SchemaTemplate`	The updated schema template.

Examples:

Rename a column in the schema:

schema = deeplake.schemas.SchemaTemplate({
    "id": types.UInt64(),
    "text": types.Text(),
    "embedding": types.Embedding(768)
})
schema.rename("embedding", "text_embedding")

TextEmbeddings

TextEmbeddings(
    embedding_size: int, quantize: bool = False
) -> SchemaTemplate

A schema for storing embedded text from documents.

This schema includes the following fields: - id (uint64): Unique identifier for each entry. - chunk_index (uint16): Position of the text chunk within the document. - document_id (uint64): Unique identifier for the document the embedding came from. - date_created (uint64): Timestamp when the document was read. - text_chunk (text): The text of the shard. - embedding (dtype=float32, size=embedding_size): The embedding of the text.

Parameters:

Name	Type	Description	Default
`embedding_size`	`int`	int Size of the embeddings.	required
`quantize`	`bool`	bool, optional If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed. Default is False.	`False`

Examples:

Create a dataset with the standard schema:

ds = deeplake.create("tmp://", schema=deeplake.schemas.TextEmbeddings(768))

Customize the schema before creating the dataset:

ds = deeplake.create("tmp://", schema=deeplake.schemas.TextEmbeddings(768)
    .rename("embedding", "text_embed")
    .add("author", types.Text()))

Add a new field to the schema:

schema = deeplake.schemas.TextEmbeddings(768)
schema.add("language", types.Text())
ds = deeplake.create("tmp://", schema=schema)

Storage Formats

deeplake.formats.DataFormat

Base class for all datafile formats.

deeplake.formats.Chunk

Chunk(
    sample_compression: str | None = None,
    chunk_compression: str | None = None,
) -> DataFormat

Configures a "chunk" datafile format

Parameters:

Name	Type	Description	Default
`sample_compression`	`str`	How to compress individual values within the datafile	`None`
`chunk_compression`	`str`	How to compress the datafile as a whole	`None`

Schema APIs

deeplake.Schema

__getitem__

__len__

columns property

deeplake.SchemaView

__getitem__

__len__

columns property

deeplake.ColumnDefinition

drop

dtype property

name property

rename

deeplake.ColumnDefinitionView

dtype property

name property

Default Schemas

COCOImages

SchemaTemplate

__init__

add

remove

rename

TextEmbeddings

Storage Formats

deeplake.formats.DataFormat

deeplake.formats.Chunk

getitem

len

columns `property`

getitem

len

columns `property`

dtype `property`

name `property`

dtype `property`

name `property`

init