Skip to content

Schemas

Deep Lake provides pre-built schema templates for common data structures.

Schema Templates

deeplake.schemas.SchemaTemplate

A template that can be used for creating a new dataset with deeplake.create.

This class allows you to define and customize the schema for your dataset.

Parameters:

Name Type Description Default
schema dict[str, DataType | str | Type]

dict A dictionary where the key is the column name and the value is the data type.

required

Methods:

Name Description
add

str, dtype: deeplake._deeplake.types.DataType | str | deeplake._deeplake.types.Type) -> SchemaTemplate: Adds a new column to the template.

remove

str) -> SchemaTemplate: Removes a column from the template.

rename

str, new_name: str) -> SchemaTemplate: Renames a column in the template.

Examples:

Create a new schema template, modify it, and create a dataset with the schema:

schema = deeplake.schemas.SchemaTemplate({
    "id": types.UInt64(),
    "text": types.Text(),
    "embedding": types.Embedding(768)
})
schema.add("author", types.Text())
schema.remove("text")
schema.rename("embedding", "text_embedding")
ds = deeplake.create("tmp://", schema=schema)

__init__

__init__(schema: dict[str, DataType | str | Type]) -> None

Constructs a new SchemaTemplate from the given dict.

add

add(
    name: str, dtype: DataType | str | Type
) -> SchemaTemplate

Adds a new column to the template.

Parameters:

Name Type Description Default
name str

str The column name.

required
dtype DataType | str | Type

deeplake._deeplake.types.DataType | str | deeplake._deeplake.types.Type The column data type.

required

Returns:

Name Type Description
SchemaTemplate SchemaTemplate

The updated schema template.

Examples:

Add a new column to the schema:

schema = deeplake.schemas.SchemaTemplate({})
schema.add("author", types.Text())

remove

remove(name: str) -> SchemaTemplate

Removes a column from the template.

Parameters:

Name Type Description Default
name str

str The column name.

required

Returns:

Name Type Description
SchemaTemplate SchemaTemplate

The updated schema template.

Examples:

Remove a column from the schema:

schema = deeplake.schemas.SchemaTemplate({
    "id": types.UInt64(),
    "text": types.Text(),
    "embedding": types.Embedding(768)
})
schema.remove("text")

rename

rename(old_name: str, new_name: str) -> SchemaTemplate

Renames a column in the template.

Parameters:

Name Type Description Default
old_name str

str Existing column name.

required
new_name str

str New column name.

required

Returns:

Name Type Description
SchemaTemplate SchemaTemplate

The updated schema template.

Examples:

Rename a column in the schema:

schema = deeplake.schemas.SchemaTemplate({
    "id": types.UInt64(),
    "text": types.Text(),
    "embedding": types.Embedding(768)
})
schema.rename("embedding", "text_embedding")

Text Embeddings Schema

deeplake.schemas.TextEmbeddings

TextEmbeddings(
    embedding_size: int, quantize: bool = False
) -> SchemaTemplate

A schema for storing embedded text from documents.

This schema includes the following fields: - id (uint64): Unique identifier for each entry. - chunk_index (uint16): Position of the text chunk within the document. - document_id (uint64): Unique identifier for the document the embedding came from. - date_created (uint64): Timestamp when the document was read. - text_chunk (text): The text of the shard. - embedding (dtype=float32, size=embedding_size): The embedding of the text.

Parameters:

Name Type Description Default
embedding_size int

int Size of the embeddings.

required
quantize bool

bool, optional If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed. Default is False.

False

Examples:

Create a dataset with the standard schema:

ds = deeplake.create("tmp://", schema=deeplake.schemas.TextEmbeddings(768))

Customize the schema before creating the dataset:

ds = deeplake.create("tmp://", schema=deeplake.schemas.TextEmbeddings(768)
    .rename("embedding", "text_embed")
    .add("author", types.Text()))

Add a new field to the schema:

schema = deeplake.schemas.TextEmbeddings(768)
schema.add("language", types.Text())
ds = deeplake.create("tmp://", schema=schema)

# Create dataset with text embeddings schema
ds = deeplake.create("s3://bucket/dataset",
    schema=deeplake.schemas.TextEmbeddings(768))

# Customize before creation
ds = deeplake.create("s3://bucket/dataset",
    schema=deeplake.schemas.TextEmbeddings(768)
        .rename("embedding", "text_embedding")
        .add("source", deeplake.types.Text()))

# Add field to existing schema
schema = deeplake.schemas.TextEmbeddings(768)
schema.add("language", deeplake.types.Text())
ds = deeplake.create("s3://bucket/dataset", schema=schema)

COCO Images Schema

deeplake.schemas.COCOImages

COCOImages(
    embedding_size: int,
    quantize: bool = False,
    objects: bool = True,
    keypoints: bool = False,
    stuffs: bool = False,
) -> SchemaTemplate

A schema for storing COCO-based image data.

This schema includes the following fields: - id (uint64): Unique identifier for each entry. - image (jpg image): The image data. - url (text): URL of the image. - year (uint8): Year the image was captured. - version (text): Version of the dataset. - description (text): Description of the image. - contributor (text): Contributor of the image. - date_created (uint64): Timestamp when the image was created. - date_captured (uint64): Timestamp when the image was captured. - embedding (embedding): Embedding of the image. - license (text): License information. - is_crowd (bool): Whether the image contains a crowd.

If objects is true, the following fields are added: - objects_bbox (bounding box): Bounding boxes for objects. - objects_classes (segment mask): Segment masks for objects.

If keypoints is true, the following fields are added: - keypoints_bbox (bounding box): Bounding boxes for keypoints. - keypoints_classes (segment mask): Segment masks for keypoints. - keypoints (2-dimensional array of uint32): Keypoints data. - keypoints_skeleton (2-dimensional array of uint16): Skeleton data for keypoints.

If stuffs is true, the following fields are added: - stuffs_bbox (bounding boxes): Bounding boxes for stuffs. - stuffs_classes (segment mask): Segment masks for stuffs.

Parameters:

Name Type Description Default
embedding_size int

int Size of the embeddings.

required
quantize bool

bool, optional If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed. Default is False.

False
objects bool

bool, optional Whether to include object-related fields. Default is True.

True
keypoints bool

bool, optional Whether to include keypoint-related fields. Default is False.

False
stuffs bool

bool, optional Whether to include stuff-related fields. Default is False.

False

Examples:

Create a dataset with the standard schema:

ds = deeplake.create("tmp://", schema=deeplake.schemas.COCOImages(768))

Customize the schema before creating the dataset:

ds = deeplake.create("tmp://", schema=deeplake.schemas.COCOImages(768, objects=True, keypoints=True)
    .rename("embedding", "image_embed")
    .add("author", types.Text()))

Add a new field to the schema:

schema = deeplake.schemas.COCOImages(768)
schema.add("location", types.Text())
ds = deeplake.create("tmp://", schema=schema)

# Basic COCO dataset
ds = deeplake.create("s3://bucket/dataset",
    schema=deeplake.schemas.COCOImages(768))

# With keypoints and object detection
ds = deeplake.create("s3://bucket/dataset",
    schema=deeplake.schemas.COCOImages(
        embedding_size=768,
        keypoints=True,
        objects=True
    ))

# Customize schema
ds = deeplake.create("s3://bucket/dataset",
    schema=deeplake.schemas.COCOImages(768)
        .rename("image", "raw_image")
        .add("camera_id", deeplake.types.Text()))

Custom Schema Template

Create custom schema templates:

# Define custom schema
schema = deeplake.schemas.SchemaTemplate({
    "id": deeplake.types.UInt64(),
    "image": deeplake.types.Image(),
    "embedding": deeplake.types.Embedding(512),
    "metadata": deeplake.types.Dict()
})

# Create dataset with custom schema
ds = deeplake.create("s3://bucket/dataset", schema=schema)

# Modify schema
schema.add("timestamp", deeplake.types.UInt64())
schema.remove("metadata")
schema.rename("embedding", "image_embedding")

Creating datasets from predefined data formats

from_coco

Deep Lake provides a pre-built function to translate COCO format datasets into Deep Lake format.

Key Features

  • Multiple Annotation Support: Handles instances, keypoints, and stuff annotations
  • Flexible Storage: Works with both cloud and local storage
  • Data Preservation:
    • Converts segmentation to binary masks
    • Preserves category hierarchies
    • Maintains COCO metadata
  • Development Features:
    • Progress tracking during ingestion
    • Configurable tensor and group mappings

Basic Usage

The basic flow for COCO ingestion is shown below:

import deeplake

ds = deeplake.from_coco(
    images_directory=images_directory,
    annotation_files={
        "instances": instances_annotation,
        "keypoints": keypoints_annotation,
        "stuff": stuff_annotation,
    },
    dest="al://<your_org_id>/<desired_dataset_name>"
)

Advanced Configuration

You can customize group names and column mappings using file_to_group_mapping and key_to_column_mapping:

import deeplake

ds = deeplake.from_coco(
    images_directory=images_directory,
    annotation_files={
        "instances": instances_annotation,
        "keypoints": keypoints_annotation,
    },
    dest="al://<your_org_id>/<desired_dataset_name>",
    file_to_group_mapping={
        "instances": "new_instances",
        "keypoints": "new_keypoints1",
    }
)

Important Notes

  • All segmentation polygons and RLEs are converted to stacked binary masks
  • The code assumes all annotation files share the same image IDs
  • Supports storage options are
    • Deep Lake cloud storage
    • S3
    • Azure Blob Storage
    • Google Cloud Storage
    • Local File System
  • Provides progress tracking through optional tqdm integration