Column Classes¶

Deep Lake provides two column classes for different access levels:

Class	Description
Column	Full read-write access to column data
ColumnView	Read-only access to column data
ColumnDefinition	Schema definition for columns with modification capabilities
ColumnDefinitionView	Read-only schema definition for columns

Column Class¶

deeplake.Column ¶

Bases: ColumnView

Provides read-write access to a column in a dataset. Column extends ColumnView with methods for modifying data, making it suitable for dataset creation and updates in ML workflows.

The Column class allows you to:

Read and write data using integer indices, slices, or lists of indices
Modify data asynchronously for better performance
Access and modify column metadata
Handle various data types common in ML: images, embeddings, labels, etc.

Examples:

Update training labels:

# Update single label
ds["labels"][0] = 1

# Update batch of labels
ds["labels"][0:32] = new_labels

# Async update for better performance
future = ds["labels"].set_async(slice(0, 32), new_labels)
future.wait()

Store image embeddings:

# Generate and store embeddings
embeddings = model.encode(images)
ds["embeddings"][0:len(embeddings)] = embeddings

Manage column metadata:

# Store preprocessing parameters
ds["images"].metadata["mean"] = [0.485, 0.456, 0.406]
ds["images"].metadata["std"] = [0.229, 0.224, 0.225]

getitem ¶

__getitem__(
    index: int | slice | list | tuple,
) -> ndarray | list | Dict | str | bytes | None | Array

Retrieve data from the column at the specified index or range.

Parameters:

Name	Type	Description	Default
`index`	`int \| slice \| list \| tuple`	Can be: int: Single item index slice: Range of indices (e.g., 0:10) list/tuple: Multiple specific indices	required

Returns:

Type	Description
`ndarray \| list \| Dict \| str \| bytes \| None \| Array`	The data at the specified index/indices. Type depends on the column's data type.

Examples:

# Get single item
image = column[0]

# Get range
batch = column[0:32]

# Get specific indices
items = column[[1, 5, 10]]

setitem ¶

__setitem__(index: int | slice, value: Any) -> None

Set data in the column at the specified index or range.

Parameters:

Name	Type	Description	Default
`index`	`int \| slice`	Can be: int: Single item index slice: Range of indices (e.g., 0:10)	required
`value`	`Any`	The data to store. Must match the column's data type.	required

Examples:

# Update single item
column[0] = new_image

# Update range
column[0:32] = new_batch

create_index ¶

create_index(
    index_type: (
        str
        | IndexType
        | TextIndex
        | JsonIndex
        | NumericIndex
        | EmbeddingIndexType
        | EmbeddingsMatrixIndexType
        | Index
    ),
) -> None

Create an index on the column.

Parameters:

Name	Type	Description	Default
`index_type`	`str \| IndexType \| TextIndex \| JsonIndex \| NumericIndex \| EmbeddingIndexType \| EmbeddingsMatrixIndexType \| Index`	Index type to create. Can be specified in multiple ways: Using IndexType Enum (recommended) - automatically detects column type: `deeplake.types.Inverted`: Keyword lookup index for text/JSON/numeric data. Use with `CONTAINS()` `deeplake.types.BM25`: BM25-based full-text search for text. Use with `BM25_SIMILARITY()` `deeplake.types.Exact`: Exact match index for text data `deeplake.types.Clustered`: Clustered embedding index (default for embeddings) `deeplake.types.ClusteredQuantized`: Quantized clustered embedding index (faster, slight accuracy loss) `deeplake.types.PooledQuantized`: Pooled quantized index for 2D embedding matrices. Use with `MAXSIM()` Using String - automatically detects column type: `"inverted_index"`: Same as `deeplake.types.Inverted` `"bm25"`: Same as `deeplake.types.BM25` `"exact"`: Same as `deeplake.types.Exact` `"clustered"`: Same as `deeplake.types.Clustered` `"clustered_quantized"`: Same as `deeplake.types.ClusteredQuantized` `"pooled_quantized"`: Same as `deeplake.types.PooledQuantized` Using Wrapped Types - explicit type specification: deeplake.types.TextIndex() with `index_type`: For text columns (Inverted, BM25, Exact) deeplake.types.NumericIndex() with `index_type`: For numeric columns (Inverted) deeplake.types.JsonIndex() with `index_type`: For Dict columns (Inverted) deeplake.types.EmbeddingIndexType() with `index_type`: For embedding columns (Clustered, ClusteredQuantized) deeplake.types.EmbeddingsMatrixIndexType(): For 2D embedding matrix columns (PooledQuantized) deeplake.types.Index() with `wrapped_type`: Generic wrapper for any index type	required

Examples:

# Text column indexing with enum (recommended)
ds.add_column("text_col", deeplake.types.Text())
ds.append({"text_col": ["machine learning fundamentals"]})
ds["text_col"].create_index(deeplake.types.Inverted)  # For CONTAINS queries

# Text column indexing with string
ds = deeplake.create("tmp://")
ds.add_column("text_col", deeplake.types.Text())
ds.append({"text_col": ["machine learning fundamentals"]})
ds["text_col"].create_index("bm25")  # For BM25_SIMILARITY queries

# Numeric column indexing
ds = deeplake.create("tmp://")
ds.add_column("price", deeplake.types.Float32())
ds.append({"price": [29.99]})
ds["price"].create_index(deeplake.types.Inverted)  # For numeric CONTAINS queries

# Dict (JSON) column indexing with wrapped type
ds = deeplake.create("tmp://")
ds.add_column("metadata", deeplake.types.Dict())
ds.append({"metadata": [{"category": "ml"}]})
ds["metadata"].create_index(deeplake.types.JsonIndex(deeplake.types.Inverted))

# 2D Array for embedding matrices (ColBERT-style retrieval)
ds = deeplake.create("tmp://")
ds.add_column("token_embeddings", deeplake.types.Array("float32", dimensions=2))
import numpy as np
ds.append({"token_embeddings": [np.random.rand(10, 128)]})
ds["token_embeddings"].create_index(deeplake.types.PooledQuantized)  # For MAXSIM queries

drop_index ¶

drop_index(
    index_type: (
        str
        | IndexType
        | TextIndex
        | JsonIndex
        | NumericIndex
        | EmbeddingIndexType
        | EmbeddingsMatrixIndexType
        | Index
    ),
) -> None

Drop an index from the column.

Parameters:

Name	Type	Description	Default
`index_type`	`str \| IndexType \| TextIndex \| JsonIndex \| NumericIndex \| EmbeddingIndexType \| EmbeddingsMatrixIndexType \| Index`	Index type to drop. Can be specified in multiple ways: Using IndexType Enum (recommended) - automatically detects column type: `deeplake.types.Inverted`: Drop inverted index for text/JSON/numeric data `deeplake.types.BM25`: Drop BM25 index for text `deeplake.types.Exact`: Drop exact match index for text `deeplake.types.Clustered`: Drop clustered embedding index `deeplake.types.ClusteredQuantized`: Drop quantized clustered embedding index `deeplake.types.PooledQuantized`: Drop pooled quantized index for 2D embedding matrices Using String - automatically detects column type: `"inverted_index"`: Same as `deeplake.types.Inverted` `"bm25"`: Same as `deeplake.types.BM25` `"exact"`: Same as `deeplake.types.Exact` `"clustered"`: Same as `deeplake.types.Clustered` `"clustered_quantized"`: Same as `deeplake.types.ClusteredQuantized` `"pooled_quantized"`: Same as `deeplake.types.PooledQuantized` Using Wrapped Types - explicit type specification: deeplake.types.TextIndex() with `index_type`: For text columns (Inverted, BM25, Exact) deeplake.types.NumericIndex() with `index_type`: For numeric columns (Inverted) deeplake.types.JsonIndex() with `index_type`: For Dict columns (Inverted) deeplake.types.EmbeddingIndexType() with `index_type`: For embedding columns (Clustered, ClusteredQuantized) deeplake.types.EmbeddingsMatrixIndexType(): For 2D embedding matrix columns (PooledQuantized) deeplake.types.Index() with `wrapped_type`: Generic wrapper for any index type	required

Examples:

# Drop text index using enum (recommended)
ds.add_column("text_col", deeplake.types.Text())
ds.append({"text_col": ["sample text"]})
ds["text_col"].create_index(deeplake.types.Inverted)
ds["text_col"].drop_index(deeplake.types.Inverted)

# Drop text index using string
ds = deeplake.create("tmp://")
ds.add_column("text_col", deeplake.types.Text())
ds.append({"text_col": ["sample text"]})
ds["text_col"].create_index("bm25")
ds["text_col"].drop_index("bm25")

# Drop text index using wrapped type
ds = deeplake.create("tmp://")
ds.add_column("text_col", deeplake.types.Text())
ds.append({"text_col": ["sample text"]})
ds["text_col"].create_index(deeplake.types.TextIndex(deeplake.types.Exact))
ds["text_col"].drop_index(deeplake.types.TextIndex(deeplake.types.Exact))

# Drop numeric index using wrapped type
ds = deeplake.create("tmp://")
ds.add_column("price", deeplake.types.Float32())
ds.append({"price": [19.99]})
ds["price"].create_index(deeplake.types.NumericIndex(deeplake.types.Inverted))
ds["price"].drop_index(deeplake.types.NumericIndex(deeplake.types.Inverted))

# Drop 2D embedding matrix index
ds = deeplake.create("tmp://")
ds.add_column("token_embeddings", deeplake.types.Array("float32", dimensions=2))
import numpy as np
ds.append({"token_embeddings": [np.random.rand(10, 128)]})
ds["token_embeddings"].create_index(deeplake.types.PooledQuantized)
ds["token_embeddings"].drop_index(deeplake.types.PooledQuantized)

get_async ¶

get_async(index: int | slice | list | tuple) -> Future[Any]

Asynchronously retrieve data from the column. Useful for large datasets or when loading multiple items in ML pipelines.

Parameters:

Name	Type	Description	Default
`index`	`int \| slice \| list \| tuple`	Can be: int: Single item index slice: Range of indices list/tuple: Multiple specific indices	required

Returns:

Name	Type	Description
`Future`	`Future[Any]`	A Future object that resolves to the requested data.

Examples:

# Async batch load
future = column.get_async(slice(0, 32))
batch = future.result()

# Using with async/await
async def load_batch():
    batch = await column.get_async(slice(0, 32))
    return batch

get_bytes ¶

get_bytes(
    index: int | slice | list | tuple,
) -> bytes | list

get_bytes_async ¶

get_bytes_async(
    index: int | slice | list | tuple,
) -> Future[bytes | list]

indexes `property` ¶

indexes: list[Index]

Get a list of indexes on the column.

Examples:

ds.add_column("A", deeplake.types.Text(deeplake.types.BM25))
print([str(element) for element in ds["A"].indexes])

metadata `property` ¶

metadata: Metadata

name `property` ¶

name: str

Get the name of the column.

Returns:

Name	Type	Description
`str`	`str`	The column name.

set_async ¶

set_async(index: int | slice, value: Any) -> FutureVoid

Asynchronously set data in the column. Useful for large updates or when modifying multiple items in ML pipelines.

Parameters:

Name	Type	Description	Default
`index`	`int \| slice`	Can be: int: Single item index slice: Range of indices	required
`value`	`Any`	The data to store. Must match the column's data type.	required

Returns:

Name	Type	Description
`FutureVoid`	`FutureVoid`	A FutureVoid that completes when the update is finished.

Examples:

# Async batch update
future = column.set_async(slice(0, 32), new_batch)
future.wait()

# Using with async/await
async def update_batch():
    await column.set_async(slice(0, 32), new_batch)

ColumnView Class¶

deeplake.ColumnView ¶

Provides read-only access to a column in a dataset. ColumnView is designed for efficient data access in ML workflows, supporting both synchronous and asynchronous operations.

The ColumnView class allows you to:

Access column data using integer indices, slices, or lists of indices
Retrieve data asynchronously for better performance in ML pipelines
Access column metadata and properties
Get information about linked data if the column contains references

Examples:

Load image data from a column for training:

# Access a single image
image = ds["images"][0]

# Load a batch of images
batch = ds["images"][0:32]

# Async load for better performance
images_future = ds["images"].get_async(slice(0, 32))
images = images_future.result()

Access embeddings for similarity search:

# Get all embeddings
embeddings = ds["embeddings"][:]

# Get specific embeddings by indices
selected = ds["embeddings"][[1, 5, 10]]

Check column properties:

# Get column name
name = ds["images"].name

# Access metadata
if "mean" in ds["images"].metadata.keys():
    mean = dataset["images"].metadata["mean"]

getitem ¶

__getitem__(
    index: int | slice | list | tuple,
) -> ndarray | list | Dict | str | bytes | None | Array

Retrieve data from the column at the specified index or range.

Parameters:

Name	Type	Description	Default
`index`	`int \| slice \| list \| tuple`	Can be: int: Single item index slice: Range of indices (e.g., 0:10) list/tuple: Multiple specific indices	required

Returns:

Type	Description
`ndarray \| list \| Dict \| str \| bytes \| None \| Array`	The data at the specified index/indices. Type depends on the column's data type.

Examples:

# Get single item
image = column[0]

# Get range
batch = column[0:32]

# Get specific indices
items = column[[1, 5, 10]]

get_async ¶

get_async(index: int | slice | list | tuple) -> Future[Any]

Asynchronously retrieve data from the column. Useful for large datasets or when loading multiple items in ML pipelines.

Parameters:

Name	Type	Description	Default
`index`	`int \| slice \| list \| tuple`	Can be: int: Single item index slice: Range of indices list/tuple: Multiple specific indices	required

Returns:

Name	Type	Description
`Future`	`Future[Any]`	A Future object that resolves to the requested data.

Examples:

# Async batch load
future = column.get_async(slice(0, 32))
batch = future.result()

# Using with async/await
async def load_batch():
    batch = await column.get_async(slice(0, 32))
    return batch

get_bytes ¶

get_bytes(
    index: int | slice | list | tuple,
) -> bytes | list

get_bytes_async ¶

get_bytes_async(
    index: int | slice | list | tuple,
) -> Future[bytes | list]

indexes `property` ¶

indexes: list[Index]

Get a list of indexes on the column.

Examples:

ds.add_column("A", deeplake.types.Text(deeplake.types.BM25))
print([str(element) for element in ds["A"].indexes])

metadata `property` ¶

metadata: ReadOnlyMetadata

Access the column's metadata. Useful for storing statistics, preprocessing parameters, or other information about the column data.

Returns:

Name	Type	Description
`ReadOnlyMetadata`	`ReadOnlyMetadata`	A ReadOnlyMetadata object for reading metadata.

Examples:

# Access preprocessing parameters
mean = column.metadata["mean"]
std = column.metadata["std"]

# Check available metadata
for key in column.metadata.keys():
    print(f"{key}: {column.metadata[key]}")

name `property` ¶

name: str

Get the name of the column.

Returns:

Name	Type	Description
`str`	`str`	The column name.

ColumnDefinition Class¶

deeplake.ColumnDefinition ¶

dtype `property` ¶

dtype: Type

The column datatype

drop ¶

drop() -> None

Drops the column from the dataset.

name `property` ¶

name: str

The name of the column

rename ¶

rename(new_name: str) -> None

Renames the column

Parameters:

Name	Type	Description	Default
`new_name`	`str`	The new name for the column	required

ColumnDefinitionView Class¶

deeplake.ColumnDefinitionView ¶

A read-only view of a deeplake.ColumnDefinition

dtype `property` ¶

dtype: Type

The column datatype

name `property` ¶

name: str

The name of the column

Class Comparison¶

Column¶

Provides read-write access
Can modify data and metadata
Can create/drop indexes for search optimization
Access to column schema and data type information
Supports both sync and async operations
Raw bytes access for binary data
Available in Dataset

# Get mutable column
ds = deeplake.open("s3://bucket/dataset")
column = ds["images"]

# Read data
image = column[0]
batch = column[0:100]

# Write data
column[0] = new_image
column[0:100] = new_batch

# Async operations
future = column.set_async(0, new_image)
future.wait()

ColumnView¶

Read-only access
Cannot modify data
Can read metadata and schema information
Access to indexes and data type information
Supports both sync and async operations
Raw bytes access for binary data
Available in ReadOnlyDataset and DatasetView

ColumnDefinition¶

Schema-level operations for columns
Can rename and drop columns
Access to column data type definitions
Available through dataset schema

ColumnDefinitionView¶

Read-only schema information
Access to column data type definitions
Cannot modify column schema
Available through read-only dataset schemas

# Get read-only column
ro_ds = deeplake.open_read_only("s3://bucket/dataset")
ro_column = ro_ds["images"]

# Read data
image = ro_column[0]
batch = ro_column[0:100]

# Async read
future = ro_column.get_async(slice(0, 100))
batch = future.result()

Examples¶

Data Access¶

# Direct indexing
single_item = column[0]
batch = column[0:100]
selected = column[[1, 5, 10]]

# Async data access 
future = column.get_async(slice(0, 1000))
data = future.result()

Metadata and Schema Information¶

# Read metadata from any column type
name = column.name
metadata = column.metadata
data_type = column.dtype

# Update metadata (Column only)
column.metadata["mean"] = [0.485, 0.456, 0.406]
column.metadata["std"] = [0.229, 0.224, 0.225]

# Check column indexes
indexes = column.indexes
print(f"Available indexes: {indexes}")

Index Management¶

# Create text search index (Column only)
column.create_index(deeplake.types.TextIndex(deeplake.types.BM25))

# Create embedding similarity index (Column only)
column.create_index(deeplake.types.EmbeddingIndex())

# List existing indexes
print(f"Current indexes: {column.indexes}")

# Drop an index
column.drop_index(deeplake.types.TextIndex(deeplake.types.BM25))

Binary Data Access¶

# Access raw bytes data (useful for images, audio, etc.)
bytes_data = column.get_bytes(0)
batch_bytes = column.get_bytes(slice(0, 10))

# Async bytes access
future = column.get_bytes_async(slice(0, 100))
bytes_batch = future.result()

Schema Operations¶

# Access column schema information
schema = ds.schema
col_def = schema["images"]
print(f"Column type: {col_def.dtype}")
print(f"Column name: {col_def.name}")

# Modify column schema (Dataset only)
col_def.rename("processed_images")
# col_def.drop()  # Removes entire column

Column Classes¶

Column Class¶

deeplake.Column ¶

__getitem__ ¶

__setitem__ ¶

create_index ¶

drop_index ¶

get_async ¶

get_bytes ¶

get_bytes_async ¶

indexes property ¶

metadata property ¶

name property ¶

set_async ¶

ColumnView Class¶

deeplake.ColumnView ¶

__getitem__ ¶

get_async ¶

get_bytes ¶

get_bytes_async ¶

indexes property ¶

metadata property ¶

name property ¶

ColumnDefinition Class¶

deeplake.ColumnDefinition ¶

dtype property ¶

drop ¶

name property ¶

rename ¶

ColumnDefinitionView Class¶

deeplake.ColumnDefinitionView ¶

dtype property ¶

name property ¶

Class Comparison¶

Column¶

ColumnView¶

ColumnDefinition¶

ColumnDefinitionView¶

Examples¶

Data Access¶

Metadata and Schema Information¶

Index Management¶

Binary Data Access¶

Schema Operations¶

getitem ¶

setitem ¶

indexes `property` ¶

metadata `property` ¶

name `property` ¶

getitem ¶

indexes `property` ¶

metadata `property` ¶

name `property` ¶

dtype `property` ¶

name `property` ¶

dtype `property` ¶

name `property` ¶