Column Classes¶
Deep Lake provides two column classes for different access levels:
| Class | Description |
|---|---|
| Column | Full read-write access to column data |
| ColumnView | Read-only access to column data |
| ColumnDefinition | Schema definition for columns with modification capabilities |
| ColumnDefinitionView | Read-only schema definition for columns |
Column Class¶
deeplake.Column
¶
Bases: ColumnView
Provides read-write access to a column in a dataset. Column extends ColumnView with methods for modifying data, making it suitable for dataset creation and updates in ML workflows.
The Column class allows you to:
- Read and write data using integer indices, slices, or lists of indices
- Modify data asynchronously for better performance
- Access and modify column metadata
- Handle various data types common in ML: images, embeddings, labels, etc.
Examples:
Update training labels:
# Update single label
ds["labels"][0] = 1
# Update batch of labels
ds["labels"][0:32] = new_labels
# Async update for better performance
future = ds["labels"].set_async(slice(0, 32), new_labels)
future.wait()
Store image embeddings:
# Generate and store embeddings
embeddings = model.encode(images)
ds["embeddings"][0:len(embeddings)] = embeddings
Manage column metadata:
# Store preprocessing parameters
ds["images"].metadata["mean"] = [0.485, 0.456, 0.406]
ds["images"].metadata["std"] = [0.229, 0.224, 0.225]
__getitem__
¶
__getitem__(
index: int | slice | list | tuple,
) -> ndarray | list | Dict | str | bytes | None | Array
Retrieve data from the column at the specified index or range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | slice | list | tuple
|
Can be:
|
required |
Returns:
| Type | Description |
|---|---|
ndarray | list | Dict | str | bytes | None | Array
|
The data at the specified index/indices. Type depends on the column's data type. |
Examples:
__setitem__
¶
Set data in the column at the specified index or range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | slice
|
Can be:
|
required |
value
|
Any
|
The data to store. Must match the column's data type. |
required |
Examples:
create_index
¶
create_index(
index_type: (
str
| IndexType
| TextIndex
| JsonIndex
| NumericIndex
| EmbeddingIndexType
| EmbeddingsMatrixIndexType
| Index
),
) -> None
Create an index on the column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index_type
|
str | IndexType | TextIndex | JsonIndex | NumericIndex | EmbeddingIndexType | EmbeddingsMatrixIndexType | Index
|
Index type to create. Can be specified in multiple ways: Using IndexType Enum (recommended) - automatically detects column type:
Using String - automatically detects column type:
Using Wrapped Types - explicit type specification:
|
required |
Examples:
# Text column indexing with enum (recommended)
ds.add_column("text_col", deeplake.types.Text())
ds.append({"text_col": ["machine learning fundamentals"]})
ds["text_col"].create_index(deeplake.types.Inverted) # For CONTAINS queries
# Text column indexing with string
ds = deeplake.create("tmp://")
ds.add_column("text_col", deeplake.types.Text())
ds.append({"text_col": ["machine learning fundamentals"]})
ds["text_col"].create_index("bm25") # For BM25_SIMILARITY queries
# Numeric column indexing
ds = deeplake.create("tmp://")
ds.add_column("price", deeplake.types.Float32())
ds.append({"price": [29.99]})
ds["price"].create_index(deeplake.types.Inverted) # For numeric CONTAINS queries
# Dict (JSON) column indexing with wrapped type
ds = deeplake.create("tmp://")
ds.add_column("metadata", deeplake.types.Dict())
ds.append({"metadata": [{"category": "ml"}]})
ds["metadata"].create_index(deeplake.types.JsonIndex(deeplake.types.Inverted))
# 2D Array for embedding matrices (ColBERT-style retrieval)
ds = deeplake.create("tmp://")
ds.add_column("token_embeddings", deeplake.types.Array("float32", dimensions=2))
import numpy as np
ds.append({"token_embeddings": [np.random.rand(10, 128)]})
ds["token_embeddings"].create_index(deeplake.types.PooledQuantized) # For MAXSIM queries
drop_index
¶
drop_index(
index_type: (
str
| IndexType
| TextIndex
| JsonIndex
| NumericIndex
| EmbeddingIndexType
| EmbeddingsMatrixIndexType
| Index
),
) -> None
Drop an index from the column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index_type
|
str | IndexType | TextIndex | JsonIndex | NumericIndex | EmbeddingIndexType | EmbeddingsMatrixIndexType | Index
|
Index type to drop. Can be specified in multiple ways: Using IndexType Enum (recommended) - automatically detects column type:
Using String - automatically detects column type:
Using Wrapped Types - explicit type specification:
|
required |
Examples:
# Drop text index using enum (recommended)
ds.add_column("text_col", deeplake.types.Text())
ds.append({"text_col": ["sample text"]})
ds["text_col"].create_index(deeplake.types.Inverted)
ds["text_col"].drop_index(deeplake.types.Inverted)
# Drop text index using string
ds = deeplake.create("tmp://")
ds.add_column("text_col", deeplake.types.Text())
ds.append({"text_col": ["sample text"]})
ds["text_col"].create_index("bm25")
ds["text_col"].drop_index("bm25")
# Drop text index using wrapped type
ds = deeplake.create("tmp://")
ds.add_column("text_col", deeplake.types.Text())
ds.append({"text_col": ["sample text"]})
ds["text_col"].create_index(deeplake.types.TextIndex(deeplake.types.Exact))
ds["text_col"].drop_index(deeplake.types.TextIndex(deeplake.types.Exact))
# Drop numeric index using wrapped type
ds = deeplake.create("tmp://")
ds.add_column("price", deeplake.types.Float32())
ds.append({"price": [19.99]})
ds["price"].create_index(deeplake.types.NumericIndex(deeplake.types.Inverted))
ds["price"].drop_index(deeplake.types.NumericIndex(deeplake.types.Inverted))
# Drop 2D embedding matrix index
ds = deeplake.create("tmp://")
ds.add_column("token_embeddings", deeplake.types.Array("float32", dimensions=2))
import numpy as np
ds.append({"token_embeddings": [np.random.rand(10, 128)]})
ds["token_embeddings"].create_index(deeplake.types.PooledQuantized)
ds["token_embeddings"].drop_index(deeplake.types.PooledQuantized)
get_async
¶
Asynchronously retrieve data from the column. Useful for large datasets or when loading multiple items in ML pipelines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | slice | list | tuple
|
Can be:
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
Future |
Future[Any]
|
A Future object that resolves to the requested data. |
Examples:
name
property
¶
Get the name of the column.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The column name. |
set_async
¶
Asynchronously set data in the column. Useful for large updates or when modifying multiple items in ML pipelines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | slice
|
Can be:
|
required |
value
|
Any
|
The data to store. Must match the column's data type. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
FutureVoid |
FutureVoid
|
A FutureVoid that completes when the update is finished. |
Examples:
ColumnView Class¶
deeplake.ColumnView
¶
Provides read-only access to a column in a dataset. ColumnView is designed for efficient data access in ML workflows, supporting both synchronous and asynchronous operations.
The ColumnView class allows you to:
- Access column data using integer indices, slices, or lists of indices
- Retrieve data asynchronously for better performance in ML pipelines
- Access column metadata and properties
- Get information about linked data if the column contains references
Examples:
Load image data from a column for training:
# Access a single image
image = ds["images"][0]
# Load a batch of images
batch = ds["images"][0:32]
# Async load for better performance
images_future = ds["images"].get_async(slice(0, 32))
images = images_future.result()
Access embeddings for similarity search:
# Get all embeddings
embeddings = ds["embeddings"][:]
# Get specific embeddings by indices
selected = ds["embeddings"][[1, 5, 10]]
Check column properties:
# Get column name
name = ds["images"].name
# Access metadata
if "mean" in ds["images"].metadata.keys():
mean = dataset["images"].metadata["mean"]
__getitem__
¶
__getitem__(
index: int | slice | list | tuple,
) -> ndarray | list | Dict | str | bytes | None | Array
Retrieve data from the column at the specified index or range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | slice | list | tuple
|
Can be:
|
required |
Returns:
| Type | Description |
|---|---|
ndarray | list | Dict | str | bytes | None | Array
|
The data at the specified index/indices. Type depends on the column's data type. |
Examples:
get_async
¶
Asynchronously retrieve data from the column. Useful for large datasets or when loading multiple items in ML pipelines.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int | slice | list | tuple
|
Can be:
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
Future |
Future[Any]
|
A Future object that resolves to the requested data. |
Examples:
metadata
property
¶
metadata: ReadOnlyMetadata
Access the column's metadata. Useful for storing statistics, preprocessing parameters, or other information about the column data.
Returns:
| Name | Type | Description |
|---|---|---|
ReadOnlyMetadata |
ReadOnlyMetadata
|
A ReadOnlyMetadata object for reading metadata. |
Examples:
name
property
¶
Get the name of the column.
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The column name. |
ColumnDefinition Class¶
deeplake.ColumnDefinition
¶
rename
¶
Renames the column
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
new_name
|
str
|
The new name for the column |
required |
ColumnDefinitionView Class¶
deeplake.ColumnDefinitionView
¶
A read-only view of a deeplake.ColumnDefinition
Class Comparison¶
Column¶
- Provides read-write access
- Can modify data and metadata
- Can create/drop indexes for search optimization
- Access to column schema and data type information
- Supports both sync and async operations
- Raw bytes access for binary data
- Available in Dataset
# Get mutable column
ds = deeplake.open("s3://bucket/dataset")
column = ds["images"]
# Read data
image = column[0]
batch = column[0:100]
# Write data
column[0] = new_image
column[0:100] = new_batch
# Async operations
future = column.set_async(0, new_image)
future.wait()
ColumnView¶
- Read-only access
- Cannot modify data
- Can read metadata and schema information
- Access to indexes and data type information
- Supports both sync and async operations
- Raw bytes access for binary data
- Available in ReadOnlyDataset and DatasetView
ColumnDefinition¶
- Schema-level operations for columns
- Can rename and drop columns
- Access to column data type definitions
- Available through dataset schema
ColumnDefinitionView¶
- Read-only schema information
- Access to column data type definitions
- Cannot modify column schema
- Available through read-only dataset schemas
# Get read-only column
ro_ds = deeplake.open_read_only("s3://bucket/dataset")
ro_column = ro_ds["images"]
# Read data
image = ro_column[0]
batch = ro_column[0:100]
# Async read
future = ro_column.get_async(slice(0, 100))
batch = future.result()
Examples¶
Data Access¶
# Direct indexing
single_item = column[0]
batch = column[0:100]
selected = column[[1, 5, 10]]
# Async data access
future = column.get_async(slice(0, 1000))
data = future.result()
Metadata and Schema Information¶
# Read metadata from any column type
name = column.name
metadata = column.metadata
data_type = column.dtype
# Update metadata (Column only)
column.metadata["mean"] = [0.485, 0.456, 0.406]
column.metadata["std"] = [0.229, 0.224, 0.225]
# Check column indexes
indexes = column.indexes
print(f"Available indexes: {indexes}")
Index Management¶
# Create text search index (Column only)
column.create_index(deeplake.types.TextIndex(deeplake.types.BM25))
# Create embedding similarity index (Column only)
column.create_index(deeplake.types.EmbeddingIndex())
# List existing indexes
print(f"Current indexes: {column.indexes}")
# Drop an index
column.drop_index(deeplake.types.TextIndex(deeplake.types.BM25))
Binary Data Access¶
# Access raw bytes data (useful for images, audio, etc.)
bytes_data = column.get_bytes(0)
batch_bytes = column.get_bytes(slice(0, 10))
# Async bytes access
future = column.get_bytes_async(slice(0, 100))
bytes_batch = future.result()