Schemas¶

Deep Lake provides pre-built schema templates for common data structures.

Schema Templates¶

Schema templates are Python dictionaries that define the structure of the dataset. Each schema template is a dictionary with field names as keys and field types as values.

Text Embeddings Schema¶

deeplake.schemas.TextEmbeddings ¶

TextEmbeddings(
    embedding_size: int, quantize: bool = False
) -> dict[str, DataType | str | Type]

A schema for storing embedded text from documents.

This schema includes the following fields: - id (uint64): Unique identifier for each entry. - chunk_index (uint16): Position of the text chunk within the document. - document_id (uint64): Unique identifier for the document the embedding came from. - date_created (uint64): Timestamp when the document was read. - text_chunk (text): The text of the shard. - embedding (dtype=float32, size=embedding_size): The embedding of the text.

Parameters:

Name	Type	Description	Default
`embedding_size`	`int`	int Size of the embeddings.	required
`quantize`	`bool`	bool, optional If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed. Default is False.	`False`

Examples:

Create a dataset with the standard schema:

ds = deeplake.create("tmp://", schema=deeplake.schemas.TextEmbeddings(768))

Customize the schema before creating the dataset:

schema = deeplake.schemas.TextEmbeddings(768)
schema["text_embed"] = schema.pop("embedding")
schema["author"] = types.Text()
ds = deeplake.create("tmp://", schema=schema)

Add a new field to the schema:

schema = deeplake.schemas.TextEmbeddings(768)
schema["language"] = types.Text()
ds = deeplake.create("tmp://", schema=schema)

# Create dataset with text embeddings schema
ds = deeplake.create("s3://bucket/dataset",
    schema=deeplake.schemas.TextEmbeddings(768))

# Customize before creation
schema = deeplake.schemas.TextEmbeddings(768)
schema["text_embedding"] = schema.pop("embedding")
schema["source"] = deeplake.types.Text()
ds = deeplake.create("s3://bucket/dataset", schema=schema)

# Add field to existing schema
schema = deeplake.schemas.TextEmbeddings(768)
schema["language"] = deeplake.types.Text()
ds = deeplake.create("s3://bucket/dataset", schema=schema)

COCO Images Schema¶

deeplake.schemas.COCOImages ¶

COCOImages(
    embedding_size: int,
    quantize: bool = False,
    objects: bool = True,
    keypoints: bool = False,
    stuffs: bool = False,
) -> dict[str, DataType | str | Type]

A schema for storing COCO-based image data.

This schema includes the following fields: - id (uint64): Unique identifier for each entry. - image (jpg image): The image data. - url (text): URL of the image. - year (uint8): Year the image was captured. - version (text): Version of the dataset. - description (text): Description of the image. - contributor (text): Contributor of the image. - date_created (uint64): Timestamp when the image was created. - date_captured (uint64): Timestamp when the image was captured. - embedding (embedding): Embedding of the image. - license (text): License information. - is_crowd (bool): Whether the image contains a crowd.

If objects is true, the following fields are added: - objects_bbox (bounding box): Bounding boxes for objects. - objects_classes (segment mask): Segment masks for objects.

If keypoints is true, the following fields are added: - keypoints_bbox (bounding box): Bounding boxes for keypoints. - keypoints_classes (segment mask): Segment masks for keypoints. - keypoints (2-dimensional array of uint32): Keypoints data. - keypoints_skeleton (2-dimensional array of uint16): Skeleton data for keypoints.

If stuffs is true, the following fields are added: - stuffs_bbox (bounding boxes): Bounding boxes for stuffs. - stuffs_classes (segment mask): Segment masks for stuffs.

Parameters:

Name	Type	Description	Default
`embedding_size`	`int`	int Size of the embeddings.	required
`quantize`	`bool`	bool, optional If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed. Default is False.	`False`
`objects`	`bool`	bool, optional Whether to include object-related fields. Default is True.	`True`
`keypoints`	`bool`	bool, optional Whether to include keypoint-related fields. Default is False.	`False`
`stuffs`	`bool`	bool, optional Whether to include stuff-related fields. Default is False.	`False`

Examples:

Create a dataset with the standard schema:

ds = deeplake.create("tmp://", schema=deeplake.schemas.COCOImages(768))

Customize the schema before creating the dataset:

schema = deeplake.schemas.COCOImages(768, objects=True, keypoints=True)
schema["image_embed"] = schema.pop("embedding")
schema["author"] = types.Text()
ds = deeplake.create("tmp://", schema=schema)

Add a new field to the schema:

schema = deeplake.schemas.COCOImages(768)
schema["location"] = types.Text()
ds = deeplake.create("tmp://", schema=schema)

# Basic COCO dataset
ds = deeplake.create("s3://bucket/dataset",
    schema=deeplake.schemas.COCOImages(768))

# With keypoints and object detection
ds = deeplake.create("s3://bucket/dataset",
    schema=deeplake.schemas.COCOImages(
        embedding_size=768,
        keypoints=True,
        objects=True
    ))

# Customize schema
schema = deeplake.schemas.COCOImages(768)
schema["raw_image"] = schema.pop("image")
schema["camera_id"] = deeplake.types.Text()
ds = deeplake.create("s3://bucket/dataset", schema=schema)

Custom Schema Template¶

Create custom schema templates:

# Define custom schema
schema = {
    "id": deeplake.types.UInt64(),
    "image": deeplake.types.Image(),
    "embedding": deeplake.types.Embedding(512),
    "metadata": deeplake.types.Dict()
}

# Create dataset with custom schema
ds = deeplake.create("s3://bucket/dataset", schema=schema)

# Modify schema
schema["timestamp"] = deeplake.types.UInt64()
schema.pop("metadata")
schema["image_embedding"] = schema.pop("embedding")

Creating datasets from predefined data formats¶

from_coco¶

Deep Lake provides a pre-built function to translate COCO format datasets into Deep Lake format.

Key Features¶

Multiple Annotation Support: Handles instances, keypoints, and stuff annotations
Flexible Storage: Works with both cloud and local storage
Data Preservation:
- Converts segmentation to binary masks
- Preserves category hierarchies
- Maintains COCO metadata
Development Features:
- Progress tracking during ingestion
- Configurable tensor and group mappings

Basic Usage¶

The basic flow for COCO ingestion is shown below:

import deeplake

ds = deeplake.from_coco(
    images_directory=images_directory,
    annotation_files={
        "instances": instances_annotation,
        "keypoints": keypoints_annotation,
        "stuff": stuff_annotation,
    },
    dest="al://<your_org_id>/<desired_dataset_name>"
)

Advanced Configuration¶

You can customize group names and column mappings using file_to_group_mapping and key_to_column_mapping:

import deeplake

ds = deeplake.from_coco(
    images_directory=images_directory,
    annotation_files={
        "instances": instances_annotation,
        "keypoints": keypoints_annotation,
    },
    dest="al://<your_org_id>/<desired_dataset_name>",
    file_to_group_mapping={
        "instances": "new_instances",
        "keypoints": "new_keypoints1",
    }
)

Important Notes¶

All segmentation polygons and RLEs are converted to stacked binary masks
The code assumes all annotation files share the same image IDs
Supports storage options are
- Deep Lake cloud storage
- S3
- Azure Blob Storage
- Google Cloud Storage
- Local File System
Provides progress tracking through optional tqdm integration