semanticlens.foundation_models package

Submodules

semanticlens.foundation_models.base module

Base classes for foundation models and processors.

This module defines abstract base classes for vision-language foundation models and their processors, providing a consistent interface for different model implementations.

class semanticlens.foundation_models.base.AbstractVLM[source]

Bases: ABC

Abstract base class for vision-language foundation models.

This class defines the interface that all vision-language foundation models must implement, providing methods for encoding both vision and text inputs.

abstract property device

Get the device on which the model is located.

Returns:

The device (CPU/GPU) on which the model parameters are located.

Return type:

torch.device

abstract encode_image(*args, **kwargs)[source]

Encode image input into feature representation.

Parameters:
  • *args – Variable length argument list for image inputs.

  • **kwargs – Arbitrary keyword arguments for encoding options.

Returns:

Encoded image features.

Return type:

torch.Tensor

abstract encode_text(*args, **kwargs)[source]

Encode text input into feature representation.

Parameters:
  • *args – Variable length argument list for text inputs.

  • **kwargs – Arbitrary keyword arguments for encoding options.

Returns:

Encoded text features.

Return type:

torch.Tensor

abstract preprocess(img)[source]

Preprocess image input for model consumption.

Parameters:

img (torch.Tensor) – Input image tensor to preprocess.

Returns:

Preprocessed image tensor ready for model input.

Return type:

torch.Tensor

abstract to(device)[source]

Move the model to the specified device.

Parameters:

device (str or torch.device) – The target device to move the model to (e.g., ‘cpu’, ‘cuda:0’).

Returns:

The model instance after moving to the specified device.

Return type:

VisionLanguageFoundationModel

abstract tokenize(txt)[source]

Tokenize text input for model consumption.

Parameters:

txt (str) – Input text string to tokenize.

Returns:

Tokenized text tensor ready for model input.

Return type:

torch.Tensor

semanticlens.foundation_models.clip module

CLIP model implementations for vision-language tasks.

This module provides concrete implementations of various CLIP model variants, including OpenCLIP, SigLIP V2, and MobileCLIP models for encoding both images and text into a shared embedding space.

Classes

OpenClipVisionLanguageFoundationModel

OpenCLIP implementation supporting various model architectures.

SigLipV2OpenClip

SigLIP V2 model implementation.

ClipMobileOpenClip

MobileCLIP model implementation optimized for mobile deployment.

class semanticlens.foundation_models.clip.ClipMobile(version='s1', device='cpu', **kwargs)[source]

Bases: OpenClip

MobileCLIP vision-language model implementation.

A specialized OpenCLIP implementation using MobileCLIP models optimized for mobile deployment with efficient inference while maintaining performance.

Parameters:
  • version (str, optional) – The MobileCLIP version to use (‘s1’ or ‘s2’), by default “s1”.

  • device (str, optional) – The device to load the model on, by default “cpu”.

URLs

Dictionary mapping version names to model identifiers.

Type:

dict

URLs = {'s1': 'MobileCLIP-S1', 's2': 'MobileCLIP-S2'}
__init__(version='s1', device='cpu', **kwargs)[source]
class semanticlens.foundation_models.clip.OpenClip(url, device='cpu', **kwargs)[source]

Bases: AbstractVLM

OpenCLIP vision-language model implementation.

This class provides a concrete implementation of the VisionLanguageFoundationModel abstract base class using OpenCLIP models. It supports encoding images and text into a shared embedding space for various vision-language tasks.

Parameters:
  • url (str) – The model URL or identifier for OpenCLIP model loading.

  • device (str, optional) – The device to load the model on, by default “cpu”.

model

The loaded OpenCLIP model.

Type:

torch.nn.Module

preprocessor

Image preprocessing function.

Type:

callable

tokenizer

Text tokenization function.

Type:

callable

__init__(url, device='cpu', **kwargs)[source]
__repr__()[source]

Return a string representation of the ClipMobile instance.

Returns:

String representation including the model version and device.

Return type:

str

property device

Get the device on which the model is located.

Returns:

The device (CPU/GPU) on which the model parameters are located.

Return type:

torch.device

encode_image(img)[source]

Encode an image tensor into features.

Parameters:

img (torch.Tensor) – Input image tensor.

Returns:

Encoded image features.

Return type:

torch.Tensor

encode_text(text_input)[source]

Encode a text tensor into features.

Parameters:

text_input (torch.Tensor) – Input text tensor.

Returns:

Encoded text features.

Return type:

torch.Tensor

preprocess(img)[source]

Apply foundation model image preprocessing.

Preprocesses images for model consumption, handling both single images and lists of images. Also handles tensor dimension expansion and device placement automatically.

Parameters:

img (Image.Image or list[Image.Image]) – Input image(s) to preprocess. Can be a single PIL Image or a list of PIL Images.

Returns:

Preprocessed image tensor(s) ready for model input, moved to the correct device with proper batch dimensions.

Return type:

torch.Tensor

to(device)[source]

Move the model to the specified device.

Parameters:

device (str or torch.device) – The target device to move the model to (e.g., ‘cpu’, ‘cuda:0’).

Returns:

The model instance after moving to the specified device.

Return type:

torch.nn.Module

tokenize(txt, context_length=None)[source]

Tokenize a text string and move to the correct device.

Converts input text into tokenized format suitable for the model, automatically moving the result to the model’s device.

Parameters:
  • txt (str) – Input text string to tokenize.

  • context_length (int, optional) – Maximum context length for tokenization. If None, uses the model’s default context length.

Returns:

Tokenized text tensor ready for model input, on the correct device.

Return type:

torch.Tensor

class semanticlens.foundation_models.clip.SigLipV2(device='cpu', **kwargs)[source]

Bases: OpenClip

SigLIP V2 vision-language model implementation.

A specialized OpenCLIP implementation using the SigLIP V2 model architecture optimized for improved vision-language understanding.

Parameters:

device (str, optional) – The device to load the model on, by default “cpu”.

URL

The model identifier for SigLIP V2 model loading.

Type:

str

URL = 'hf-hub:timm/ViT-B-16-SigLIP2'
__init__(device='cpu', **kwargs)[source]

Module contents

Foundation model implementations for semantic analysis.

This module provides implementations of vision-language foundation models, currently supporting various CLIP model variants from different sources.

Classes

class semanticlens.foundation_models.ClipMobile(version='s1', device='cpu', **kwargs)[source]

Bases: OpenClip

MobileCLIP vision-language model implementation.

A specialized OpenCLIP implementation using MobileCLIP models optimized for mobile deployment with efficient inference while maintaining performance.

Parameters:
  • version (str, optional) – The MobileCLIP version to use (‘s1’ or ‘s2’), by default “s1”.

  • device (str, optional) – The device to load the model on, by default “cpu”.

URLs

Dictionary mapping version names to model identifiers.

Type:

dict

URLs = {'s1': 'MobileCLIP-S1', 's2': 'MobileCLIP-S2'}
__init__(version='s1', device='cpu', **kwargs)[source]
class semanticlens.foundation_models.OpenClip(url, device='cpu', **kwargs)[source]

Bases: AbstractVLM

OpenCLIP vision-language model implementation.

This class provides a concrete implementation of the VisionLanguageFoundationModel abstract base class using OpenCLIP models. It supports encoding images and text into a shared embedding space for various vision-language tasks.

Parameters:
  • url (str) – The model URL or identifier for OpenCLIP model loading.

  • device (str, optional) – The device to load the model on, by default “cpu”.

model

The loaded OpenCLIP model.

Type:

torch.nn.Module

preprocessor

Image preprocessing function.

Type:

callable

tokenizer

Text tokenization function.

Type:

callable

__init__(url, device='cpu', **kwargs)[source]
__repr__()[source]

Return a string representation of the ClipMobile instance.

Returns:

String representation including the model version and device.

Return type:

str

property device

Get the device on which the model is located.

Returns:

The device (CPU/GPU) on which the model parameters are located.

Return type:

torch.device

encode_image(img)[source]

Encode an image tensor into features.

Parameters:

img (torch.Tensor) – Input image tensor.

Returns:

Encoded image features.

Return type:

torch.Tensor

encode_text(text_input)[source]

Encode a text tensor into features.

Parameters:

text_input (torch.Tensor) – Input text tensor.

Returns:

Encoded text features.

Return type:

torch.Tensor

preprocess(img)[source]

Apply foundation model image preprocessing.

Preprocesses images for model consumption, handling both single images and lists of images. Also handles tensor dimension expansion and device placement automatically.

Parameters:

img (Image.Image or list[Image.Image]) – Input image(s) to preprocess. Can be a single PIL Image or a list of PIL Images.

Returns:

Preprocessed image tensor(s) ready for model input, moved to the correct device with proper batch dimensions.

Return type:

torch.Tensor

to(device)[source]

Move the model to the specified device.

Parameters:

device (str or torch.device) – The target device to move the model to (e.g., ‘cpu’, ‘cuda:0’).

Returns:

The model instance after moving to the specified device.

Return type:

torch.nn.Module

tokenize(txt, context_length=None)[source]

Tokenize a text string and move to the correct device.

Converts input text into tokenized format suitable for the model, automatically moving the result to the model’s device.

Parameters:
  • txt (str) – Input text string to tokenize.

  • context_length (int, optional) – Maximum context length for tokenization. If None, uses the model’s default context length.

Returns:

Tokenized text tensor ready for model input, on the correct device.

Return type:

torch.Tensor

class semanticlens.foundation_models.SigLipV2(device='cpu', **kwargs)[source]

Bases: OpenClip

SigLIP V2 vision-language model implementation.

A specialized OpenCLIP implementation using the SigLIP V2 model architecture optimized for improved vision-language understanding.

Parameters:

device (str, optional) – The device to load the model on, by default “cpu”.

URL

The model identifier for SigLIP V2 model loading.

Type:

str

URL = 'hf-hub:timm/ViT-B-16-SigLIP2'
__init__(device='cpu', **kwargs)[source]