Skip to content

Schemas

Deep Lake provides pre-built schema templates for common data structures.

Schema Templates

Schema templates are Python dictionaries that define the structure of the dataset. Each schema template is a dictionary with field names as keys and field types as values.

Text Embeddings Schema

deeplake.schemas.TextEmbeddings

TextEmbeddings(
    embedding_size: int, quantize: bool = False
) -> dict[str, DataType | str | Type]

A schema for storing embedded text from documents.

This schema includes the following fields: - id (uint64): Unique identifier for each entry. - chunk_index (uint16): Position of the text chunk within the document. - document_id (uint64): Unique identifier for the document the embedding came from. - date_created (uint64): Timestamp when the document was read. - text_chunk (text): The text of the shard. - embedding (dtype=float32, size=embedding_size): The embedding of the text.

Parameters:

Name Type Description Default
embedding_size int

int Size of the embeddings.

required
quantize bool

bool, optional If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed. Default is False.

False

Examples:

Create a dataset with the standard schema:

ds = deeplake.create("tmp://", schema=deeplake.schemas.TextEmbeddings(768))

Customize the schema before creating the dataset:

schema = deeplake.schemas.TextEmbeddings(768)
schema["text_embed"] = schema.pop("embedding")
schema["author"] = types.Text()
ds = deeplake.create("tmp://", schema=schema)

Add a new field to the schema:

schema = deeplake.schemas.TextEmbeddings(768)
schema["language"] = types.Text()
ds = deeplake.create("tmp://", schema=schema)

# Create dataset with text embeddings schema
ds = deeplake.create("s3://bucket/dataset",
    schema=deeplake.schemas.TextEmbeddings(768))

# Customize before creation
schema = deeplake.schemas.TextEmbeddings(768)
schema["text_embedding"] = schema.pop("embedding")
schema["source"] = deeplake.types.Text()
ds = deeplake.create("s3://bucket/dataset", schema=schema)

# Add field to existing schema
schema = deeplake.schemas.TextEmbeddings(768)
schema["language"] = deeplake.types.Text()
ds = deeplake.create("s3://bucket/dataset", schema=schema)

COCO Images Schema

deeplake.schemas.COCOImages

COCOImages(
    embedding_size: int,
    quantize: bool = False,
    objects: bool = True,
    keypoints: bool = False,
    stuffs: bool = False,
) -> dict[str, DataType | str | Type]

A schema for storing COCO-based image data.

This schema includes the following fields: - id (uint64): Unique identifier for each entry. - image (jpg image): The image data. - url (text): URL of the image. - year (uint8): Year the image was captured. - version (text): Version of the dataset. - description (text): Description of the image. - contributor (text): Contributor of the image. - date_created (uint64): Timestamp when the image was created. - date_captured (uint64): Timestamp when the image was captured. - embedding (embedding): Embedding of the image. - license (text): License information. - is_crowd (bool): Whether the image contains a crowd.

If objects is true, the following fields are added: - objects_bbox (bounding box): Bounding boxes for objects. - objects_classes (segment mask): Segment masks for objects.

If keypoints is true, the following fields are added: - keypoints_bbox (bounding box): Bounding boxes for keypoints. - keypoints_classes (segment mask): Segment masks for keypoints. - keypoints (2-dimensional array of uint32): Keypoints data. - keypoints_skeleton (2-dimensional array of uint16): Skeleton data for keypoints.

If stuffs is true, the following fields are added: - stuffs_bbox (bounding boxes): Bounding boxes for stuffs. - stuffs_classes (segment mask): Segment masks for stuffs.

Parameters:

Name Type Description Default
embedding_size int

int Size of the embeddings.

required
quantize bool

bool, optional If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed. Default is False.

False
objects bool

bool, optional Whether to include object-related fields. Default is True.

True
keypoints bool

bool, optional Whether to include keypoint-related fields. Default is False.

False
stuffs bool

bool, optional Whether to include stuff-related fields. Default is False.

False

Examples:

Create a dataset with the standard schema:

ds = deeplake.create("tmp://", schema=deeplake.schemas.COCOImages(768))

Customize the schema before creating the dataset:

schema = deeplake.schemas.COCOImages(768, objects=True, keypoints=True)
schema["image_embed"] = schema.pop("embedding")
schema["author"] = types.Text()
ds = deeplake.create("tmp://", schema=schema)

Add a new field to the schema:

schema = deeplake.schemas.COCOImages(768)
schema["location"] = types.Text()
ds = deeplake.create("tmp://", schema=schema)

# Basic COCO dataset
ds = deeplake.create("s3://bucket/dataset",
    schema=deeplake.schemas.COCOImages(768))

# With keypoints and object detection
ds = deeplake.create("s3://bucket/dataset",
    schema=deeplake.schemas.COCOImages(
        embedding_size=768,
        keypoints=True,
        objects=True
    ))

# Customize schema
schema = deeplake.schemas.COCOImages(768)
schema["raw_image"] = schema.pop("image")
schema["camera_id"] = deeplake.types.Text()
ds = deeplake.create("s3://bucket/dataset", schema=schema)

Custom Schema Template

Create custom schema templates:

# Define custom schema
schema = {
    "id": deeplake.types.UInt64(),
    "image": deeplake.types.Image(),
    "embedding": deeplake.types.Embedding(512),
    "metadata": deeplake.types.Dict()
}

# Create dataset with custom schema
ds = deeplake.create("s3://bucket/dataset", schema=schema)

# Modify schema
schema["timestamp"] = deeplake.types.UInt64()
schema.pop("metadata")
schema["image_embedding"] = schema.pop("embedding")

Creating datasets from predefined data formats

from_coco

Deep Lake provides a pre-built function to translate COCO format datasets into Deep Lake format.

Key Features

  • Multiple Annotation Support: Handles instances, keypoints, and stuff annotations
  • Flexible Storage: Works with both cloud and local storage
  • Data Preservation:
    • Converts segmentation to binary masks
    • Preserves category hierarchies
    • Maintains COCO metadata
  • Development Features:
    • Progress tracking during ingestion
    • Configurable tensor and group mappings

Basic Usage

The basic flow for COCO ingestion is shown below:

import deeplake

ds = deeplake.from_coco(
    images_directory=images_directory,
    annotation_files={
        "instances": instances_annotation,
        "keypoints": keypoints_annotation,
        "stuff": stuff_annotation,
    },
    dest="al://<your_org_id>/<desired_dataset_name>"
)

Advanced Configuration

You can customize group names and column mappings using file_to_group_mapping and key_to_column_mapping:

import deeplake

ds = deeplake.from_coco(
    images_directory=images_directory,
    annotation_files={
        "instances": instances_annotation,
        "keypoints": keypoints_annotation,
    },
    dest="al://<your_org_id>/<desired_dataset_name>",
    file_to_group_mapping={
        "instances": "new_instances",
        "keypoints": "new_keypoints1",
    }
)

Important Notes

  • All segmentation polygons and RLEs are converted to stacked binary masks
  • The code assumes all annotation files share the same image IDs
  • Supports storage options are
    • Deep Lake cloud storage
    • S3
    • Azure Blob Storage
    • Google Cloud Storage
    • Local File System
  • Provides progress tracking through optional tqdm integration