Schemas¶
Deep Lake provides pre-built schema templates for common data structures.
Schema Templates¶
Schema templates are Python dictionaries that define the structure of the dataset. Each schema template is a dictionary with field names as keys and field types as values.
Text Embeddings Schema¶
deeplake.schemas.TextEmbeddings
¶
A schema for storing embedded text from documents.
This schema includes the following fields: - id (uint64): Unique identifier for each entry. - chunk_index (uint16): Position of the text chunk within the document. - document_id (uint64): Unique identifier for the document the embedding came from. - date_created (uint64): Timestamp when the document was read. - text_chunk (text): The text of the shard. - embedding (dtype=float32, size=embedding_size): The embedding of the text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embedding_size
|
int
|
int Size of the embeddings. |
required |
quantize
|
bool
|
bool, optional If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed. Default is False. |
False
|
Examples:
Create a dataset with the standard schema:
Customize the schema before creating the dataset:
schema = deeplake.schemas.TextEmbeddings(768)
schema["text_embed"] = schema.pop("embedding")
schema["author"] = types.Text()
ds = deeplake.create("tmp://", schema=schema)
Add a new field to the schema:
# Create dataset with text embeddings schema
ds = deeplake.create("s3://bucket/dataset",
schema=deeplake.schemas.TextEmbeddings(768))
# Customize before creation
schema = deeplake.schemas.TextEmbeddings(768)
schema["text_embedding"] = schema.pop("embedding")
schema["source"] = deeplake.types.Text()
ds = deeplake.create("s3://bucket/dataset", schema=schema)
# Add field to existing schema
schema = deeplake.schemas.TextEmbeddings(768)
schema["language"] = deeplake.types.Text()
ds = deeplake.create("s3://bucket/dataset", schema=schema)
COCO Images Schema¶
deeplake.schemas.COCOImages
¶
COCOImages(
embedding_size: int,
quantize: bool = False,
objects: bool = True,
keypoints: bool = False,
stuffs: bool = False,
) -> dict[str, DataType | str | Type]
A schema for storing COCO-based image data.
This schema includes the following fields: - id (uint64): Unique identifier for each entry. - image (jpg image): The image data. - url (text): URL of the image. - year (uint8): Year the image was captured. - version (text): Version of the dataset. - description (text): Description of the image. - contributor (text): Contributor of the image. - date_created (uint64): Timestamp when the image was created. - date_captured (uint64): Timestamp when the image was captured. - embedding (embedding): Embedding of the image. - license (text): License information. - is_crowd (bool): Whether the image contains a crowd.
If objects
is true, the following fields are added:
- objects_bbox (bounding box): Bounding boxes for objects.
- objects_classes (segment mask): Segment masks for objects.
If keypoints
is true, the following fields are added:
- keypoints_bbox (bounding box): Bounding boxes for keypoints.
- keypoints_classes (segment mask): Segment masks for keypoints.
- keypoints (2-dimensional array of uint32): Keypoints data.
- keypoints_skeleton (2-dimensional array of uint16): Skeleton data for keypoints.
If stuffs
is true, the following fields are added:
- stuffs_bbox (bounding boxes): Bounding boxes for stuffs.
- stuffs_classes (segment mask): Segment masks for stuffs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embedding_size
|
int
|
int Size of the embeddings. |
required |
quantize
|
bool
|
bool, optional If true, quantize the embeddings to slightly decrease accuracy while greatly increasing query speed. Default is False. |
False
|
objects
|
bool
|
bool, optional Whether to include object-related fields. Default is True. |
True
|
keypoints
|
bool
|
bool, optional Whether to include keypoint-related fields. Default is False. |
False
|
stuffs
|
bool
|
bool, optional Whether to include stuff-related fields. Default is False. |
False
|
Examples:
Create a dataset with the standard schema:
Customize the schema before creating the dataset:
schema = deeplake.schemas.COCOImages(768, objects=True, keypoints=True)
schema["image_embed"] = schema.pop("embedding")
schema["author"] = types.Text()
ds = deeplake.create("tmp://", schema=schema)
Add a new field to the schema:
# Basic COCO dataset
ds = deeplake.create("s3://bucket/dataset",
schema=deeplake.schemas.COCOImages(768))
# With keypoints and object detection
ds = deeplake.create("s3://bucket/dataset",
schema=deeplake.schemas.COCOImages(
embedding_size=768,
keypoints=True,
objects=True
))
# Customize schema
schema = deeplake.schemas.COCOImages(768)
schema["raw_image"] = schema.pop("image")
schema["camera_id"] = deeplake.types.Text()
ds = deeplake.create("s3://bucket/dataset", schema=schema)
Custom Schema Template¶
Create custom schema templates:
# Define custom schema
schema = {
"id": deeplake.types.UInt64(),
"image": deeplake.types.Image(),
"embedding": deeplake.types.Embedding(512),
"metadata": deeplake.types.Dict()
}
# Create dataset with custom schema
ds = deeplake.create("s3://bucket/dataset", schema=schema)
# Modify schema
schema["timestamp"] = deeplake.types.UInt64()
schema.pop("metadata")
schema["image_embedding"] = schema.pop("embedding")
Creating datasets from predefined data formats¶
from_coco¶
Deep Lake provides a pre-built function to translate COCO format datasets into Deep Lake format.
Key Features¶
- Multiple Annotation Support: Handles instances, keypoints, and stuff annotations
- Flexible Storage: Works with both cloud and local storage
- Data Preservation:
- Converts segmentation to binary masks
- Preserves category hierarchies
- Maintains COCO metadata
- Development Features:
- Progress tracking during ingestion
- Configurable tensor and group mappings
Basic Usage¶
The basic flow for COCO ingestion is shown below:
import deeplake
ds = deeplake.from_coco(
images_directory=images_directory,
annotation_files={
"instances": instances_annotation,
"keypoints": keypoints_annotation,
"stuff": stuff_annotation,
},
dest="al://<your_org_id>/<desired_dataset_name>"
)
Advanced Configuration¶
You can customize group names and column mappings using file_to_group_mapping
and key_to_column_mapping
:
import deeplake
ds = deeplake.from_coco(
images_directory=images_directory,
annotation_files={
"instances": instances_annotation,
"keypoints": keypoints_annotation,
},
dest="al://<your_org_id>/<desired_dataset_name>",
file_to_group_mapping={
"instances": "new_instances",
"keypoints": "new_keypoints1",
}
)
Important Notes¶
- All segmentation polygons and RLEs are converted to stacked binary masks
- The code assumes all annotation files share the same image IDs
- Supports storage options are
- Deep Lake cloud storage
- S3
- Azure Blob Storage
- Google Cloud Storage
- Local File System
- Provides progress tracking through optional tqdm integration