semanticlens.foundation_models package¶

Submodules¶

semanticlens.foundation_models.base module¶

Base classes for foundation models and processors.

This module defines abstract base classes for vision-language foundation models and their processors, providing a consistent interface for different model implementations.

class semanticlens.foundation_models.base.AbstractVLM[source]¶

Bases: ABC

Abstract base class for vision-language foundation models.

This class defines the interface that all vision-language foundation models must implement, providing methods for encoding both vision and text inputs.

abstract property device¶

Get the device on which the model is located.

Returns:: The device (CPU/GPU) on which the model parameters are located.
Return type:: torch.device

abstract encode_image(*args, **kwargs)[source]¶

Encode image input into feature representation.

Parameters:

*args – Variable length argument list for image inputs.
**kwargs – Arbitrary keyword arguments for encoding options.

Returns:

Encoded image features.

Return type:

torch.Tensor

abstract encode_text(*args, **kwargs)[source]¶

Encode text input into feature representation.

Parameters:

*args – Variable length argument list for text inputs.
**kwargs – Arbitrary keyword arguments for encoding options.

Returns:

Encoded text features.

Return type:

torch.Tensor

abstract preprocess(img)[source]¶

Preprocess image input for model consumption.

Parameters:: img (torch.Tensor) – Input image tensor to preprocess.
Returns:: Preprocessed image tensor ready for model input.
Return type:: torch.Tensor

abstract to(device)[source]¶

Move the model to the specified device.

Parameters:: device (str or torch.device) – The target device to move the model to (e.g., ‘cpu’, ‘cuda:0’).
Returns:: The model instance after moving to the specified device.
Return type:: VisionLanguageFoundationModel

abstract tokenize(txt)[source]¶

Tokenize text input for model consumption.

Parameters:: txt (str) – Input text string to tokenize.
Returns:: Tokenized text tensor ready for model input.
Return type:: torch.Tensor

semanticlens.foundation_models.clip module¶

CLIP model implementations for vision-language tasks.

This module provides concrete implementations of various CLIP model variants, including OpenCLIP, SigLIP V2, and MobileCLIP models for encoding both images and text into a shared embedding space.

Classes¶

OpenClipVisionLanguageFoundationModel: OpenCLIP implementation supporting various model architectures.
SigLipV2OpenClip: SigLIP V2 model implementation.
ClipMobileOpenClip: MobileCLIP model implementation optimized for mobile deployment.

class semanticlens.foundation_models.clip.ClipMobile(version='s1', device='cpu', **kwargs)[source]¶

Bases: OpenClip

MobileCLIP vision-language model implementation.

A specialized OpenCLIP implementation using MobileCLIP models optimized for mobile deployment with efficient inference while maintaining performance.

Parameters:

version (str, optional) – The MobileCLIP version to use (‘s1’ or ‘s2’), by default “s1”.
device (str, optional) – The device to load the model on, by default “cpu”.

URLs¶

Dictionary mapping version names to model identifiers.

Type:: dict

URLs = {'s1': 'MobileCLIP-S1', 's2': 'MobileCLIP-S2'}¶

__init__(version='s1', device='cpu', **kwargs)[source]¶

class semanticlens.foundation_models.clip.OpenClip(url, device='cpu', **kwargs)[source]¶

Bases: AbstractVLM

OpenCLIP vision-language model implementation.

This class provides a concrete implementation of the VisionLanguageFoundationModel abstract base class using OpenCLIP models. It supports encoding images and text into a shared embedding space for various vision-language tasks.

Parameters:

url (str) – The model URL or identifier for OpenCLIP model loading.
device (str, optional) – The device to load the model on, by default “cpu”.

model¶

The loaded OpenCLIP model.

Type:: torch.nn.Module

preprocessor¶

Image preprocessing function.

Type:: callable

tokenizer¶

Text tokenization function.

Type:: callable

__init__(url, device='cpu', **kwargs)[source]¶

__repr__()[source]¶

Return a string representation of the ClipMobile instance.

Returns:: String representation including the model version and device.
Return type:: str

property device¶

Get the device on which the model is located.

Returns:: The device (CPU/GPU) on which the model parameters are located.
Return type:: torch.device

encode_image(img)[source]¶

Encode an image tensor into features.

Parameters:: img (torch.Tensor) – Input image tensor.
Returns:: Encoded image features.
Return type:: torch.Tensor

encode_text(text_input)[source]¶

Encode a text tensor into features.

Parameters:: text_input (torch.Tensor) – Input text tensor.
Returns:: Encoded text features.
Return type:: torch.Tensor

preprocess(img)[source]¶

Apply foundation model image preprocessing.

Preprocesses images for model consumption, handling both single images and lists of images. Also handles tensor dimension expansion and device placement automatically.

Parameters:: img (Image.Image or list[Image.Image]) – Input image(s) to preprocess. Can be a single PIL Image or a list of PIL Images.
Returns:: Preprocessed image tensor(s) ready for model input, moved to the correct device with proper batch dimensions.
Return type:: torch.Tensor

to(device)[source]¶

Move the model to the specified device.

Parameters:: device (str or torch.device) – The target device to move the model to (e.g., ‘cpu’, ‘cuda:0’).
Returns:: The model instance after moving to the specified device.
Return type:: torch.nn.Module

tokenize(txt, context_length=None)[source]¶

Tokenize a text string and move to the correct device.

Converts input text into tokenized format suitable for the model, automatically moving the result to the model’s device.

Parameters:

txt (str) – Input text string to tokenize.
context_length (int, optional) – Maximum context length for tokenization. If None, uses the model’s default context length.

Returns:

Tokenized text tensor ready for model input, on the correct device.

Return type:

torch.Tensor

class semanticlens.foundation_models.clip.SigLipV2(device='cpu', **kwargs)[source]¶

Bases: OpenClip

SigLIP V2 vision-language model implementation.

A specialized OpenCLIP implementation using the SigLIP V2 model architecture optimized for improved vision-language understanding.

Parameters:: device (str, optional) – The device to load the model on, by default “cpu”.

URL¶

The model identifier for SigLIP V2 model loading.

Type:: str

URL = 'hf-hub:timm/ViT-B-16-SigLIP2'¶

__init__(device='cpu', **kwargs)[source]¶

Module contents¶

Foundation model implementations for semantic analysis.

This module provides implementations of vision-language foundation models, currently supporting various CLIP model variants from different sources.

Classes¶

class semanticlens.foundation_models.ClipMobile(version='s1', device='cpu', **kwargs)[source]¶

Bases: OpenClip

MobileCLIP vision-language model implementation.

A specialized OpenCLIP implementation using MobileCLIP models optimized for mobile deployment with efficient inference while maintaining performance.

Parameters:

version (str, optional) – The MobileCLIP version to use (‘s1’ or ‘s2’), by default “s1”.
device (str, optional) – The device to load the model on, by default “cpu”.

URLs¶

Dictionary mapping version names to model identifiers.

Type:: dict

URLs = {'s1': 'MobileCLIP-S1', 's2': 'MobileCLIP-S2'}¶

__init__(version='s1', device='cpu', **kwargs)[source]¶

class semanticlens.foundation_models.OpenClip(url, device='cpu', **kwargs)[source]¶

Bases: AbstractVLM

OpenCLIP vision-language model implementation.

Parameters:

url (str) – The model URL or identifier for OpenCLIP model loading.
device (str, optional) – The device to load the model on, by default “cpu”.

model¶

The loaded OpenCLIP model.

Type:: torch.nn.Module

preprocessor¶

Image preprocessing function.

Type:: callable

tokenizer¶

Text tokenization function.

Type:: callable

__init__(url, device='cpu', **kwargs)[source]¶

__repr__()[source]¶

Return a string representation of the ClipMobile instance.

Returns:: String representation including the model version and device.
Return type:: str

property device¶

Get the device on which the model is located.

Returns:: The device (CPU/GPU) on which the model parameters are located.
Return type:: torch.device

encode_image(img)[source]¶

Encode an image tensor into features.

Parameters:: img (torch.Tensor) – Input image tensor.
Returns:: Encoded image features.
Return type:: torch.Tensor

encode_text(text_input)[source]¶

Encode a text tensor into features.

Parameters:: text_input (torch.Tensor) – Input text tensor.
Returns:: Encoded text features.
Return type:: torch.Tensor

preprocess(img)[source]¶

Apply foundation model image preprocessing.

Preprocesses images for model consumption, handling both single images and lists of images. Also handles tensor dimension expansion and device placement automatically.

Parameters:: img (Image.Image or list[Image.Image]) – Input image(s) to preprocess. Can be a single PIL Image or a list of PIL Images.
Returns:: Preprocessed image tensor(s) ready for model input, moved to the correct device with proper batch dimensions.
Return type:: torch.Tensor

to(device)[source]¶

Move the model to the specified device.

Parameters:: device (str or torch.device) – The target device to move the model to (e.g., ‘cpu’, ‘cuda:0’).
Returns:: The model instance after moving to the specified device.
Return type:: torch.nn.Module

tokenize(txt, context_length=None)[source]¶

Tokenize a text string and move to the correct device.

Converts input text into tokenized format suitable for the model, automatically moving the result to the model’s device.

Parameters:

txt (str) – Input text string to tokenize.
context_length (int, optional) – Maximum context length for tokenization. If None, uses the model’s default context length.

Returns:

Tokenized text tensor ready for model input, on the correct device.

Return type:

torch.Tensor

class semanticlens.foundation_models.SigLipV2(device='cpu', **kwargs)[source]¶

Bases: OpenClip

SigLIP V2 vision-language model implementation.

A specialized OpenCLIP implementation using the SigLIP V2 model architecture optimized for improved vision-language understanding.

Parameters:: device (str, optional) – The device to load the model on, by default “cpu”.

URL¶

The model identifier for SigLIP V2 model loading.

Type:: str

URL = 'hf-hub:timm/ViT-B-16-SigLIP2'¶

__init__(device='cpu', **kwargs)[source]¶