semanticlens.foundation_models package¶
Submodules¶
semanticlens.foundation_models.base module¶
Base classes for foundation models and processors.
This module defines abstract base classes for vision-language foundation models and their processors, providing a consistent interface for different model implementations.
- class semanticlens.foundation_models.base.AbstractVLM[source]¶
Bases:
ABCAbstract base class for vision-language foundation models.
This class defines the interface that all vision-language foundation models must implement, providing methods for encoding both vision and text inputs.
- abstract property device¶
Get the device on which the model is located.
- Returns:
The device (CPU/GPU) on which the model parameters are located.
- Return type:
- abstract encode_image(*args, **kwargs)[source]¶
Encode image input into feature representation.
- Parameters:
*args – Variable length argument list for image inputs.
**kwargs – Arbitrary keyword arguments for encoding options.
- Returns:
Encoded image features.
- Return type:
- abstract encode_text(*args, **kwargs)[source]¶
Encode text input into feature representation.
- Parameters:
*args – Variable length argument list for text inputs.
**kwargs – Arbitrary keyword arguments for encoding options.
- Returns:
Encoded text features.
- Return type:
- abstract preprocess(img)[source]¶
Preprocess image input for model consumption.
- Parameters:
img (torch.Tensor) – Input image tensor to preprocess.
- Returns:
Preprocessed image tensor ready for model input.
- Return type:
- abstract to(device)[source]¶
Move the model to the specified device.
- Parameters:
device (str or torch.device) – The target device to move the model to (e.g., ‘cpu’, ‘cuda:0’).
- Returns:
The model instance after moving to the specified device.
- Return type:
VisionLanguageFoundationModel
semanticlens.foundation_models.clip module¶
CLIP model implementations for vision-language tasks.
This module provides concrete implementations of various CLIP model variants, including OpenCLIP, SigLIP V2, and MobileCLIP models for encoding both images and text into a shared embedding space.
Classes¶
- OpenClipVisionLanguageFoundationModel
OpenCLIP implementation supporting various model architectures.
- SigLipV2OpenClip
SigLIP V2 model implementation.
- ClipMobileOpenClip
MobileCLIP model implementation optimized for mobile deployment.
- class semanticlens.foundation_models.clip.ClipMobile(version='s1', device='cpu', **kwargs)[source]¶
Bases:
OpenClipMobileCLIP vision-language model implementation.
A specialized OpenCLIP implementation using MobileCLIP models optimized for mobile deployment with efficient inference while maintaining performance.
- Parameters:
- URLs = {'s1': 'MobileCLIP-S1', 's2': 'MobileCLIP-S2'}¶
- class semanticlens.foundation_models.clip.OpenClip(url, device='cpu', **kwargs)[source]¶
Bases:
AbstractVLMOpenCLIP vision-language model implementation.
This class provides a concrete implementation of the VisionLanguageFoundationModel abstract base class using OpenCLIP models. It supports encoding images and text into a shared embedding space for various vision-language tasks.
- Parameters:
- model¶
The loaded OpenCLIP model.
- Type:
- preprocessor¶
Image preprocessing function.
- Type:
callable
- tokenizer¶
Text tokenization function.
- Type:
callable
- __repr__()[source]¶
Return a string representation of the ClipMobile instance.
- Returns:
String representation including the model version and device.
- Return type:
- property device¶
Get the device on which the model is located.
- Returns:
The device (CPU/GPU) on which the model parameters are located.
- Return type:
- encode_image(img)[source]¶
Encode an image tensor into features.
- Parameters:
img (torch.Tensor) – Input image tensor.
- Returns:
Encoded image features.
- Return type:
- encode_text(text_input)[source]¶
Encode a text tensor into features.
- Parameters:
text_input (torch.Tensor) – Input text tensor.
- Returns:
Encoded text features.
- Return type:
- preprocess(img)[source]¶
Apply foundation model image preprocessing.
Preprocesses images for model consumption, handling both single images and lists of images. Also handles tensor dimension expansion and device placement automatically.
- Parameters:
img (Image.Image or list[Image.Image]) – Input image(s) to preprocess. Can be a single PIL Image or a list of PIL Images.
- Returns:
Preprocessed image tensor(s) ready for model input, moved to the correct device with proper batch dimensions.
- Return type:
- to(device)[source]¶
Move the model to the specified device.
- Parameters:
device (str or torch.device) – The target device to move the model to (e.g., ‘cpu’, ‘cuda:0’).
- Returns:
The model instance after moving to the specified device.
- Return type:
- tokenize(txt, context_length=None)[source]¶
Tokenize a text string and move to the correct device.
Converts input text into tokenized format suitable for the model, automatically moving the result to the model’s device.
- Parameters:
- Returns:
Tokenized text tensor ready for model input, on the correct device.
- Return type:
- class semanticlens.foundation_models.clip.SigLipV2(device='cpu', **kwargs)[source]¶
Bases:
OpenClipSigLIP V2 vision-language model implementation.
A specialized OpenCLIP implementation using the SigLIP V2 model architecture optimized for improved vision-language understanding.
- Parameters:
device (str, optional) – The device to load the model on, by default “cpu”.
- URL = 'hf-hub:timm/ViT-B-16-SigLIP2'¶
Module contents¶
Foundation model implementations for semantic analysis.
This module provides implementations of vision-language foundation models, currently supporting various CLIP model variants from different sources.
Classes¶
- class semanticlens.foundation_models.ClipMobile(version='s1', device='cpu', **kwargs)[source]¶
Bases:
OpenClipMobileCLIP vision-language model implementation.
A specialized OpenCLIP implementation using MobileCLIP models optimized for mobile deployment with efficient inference while maintaining performance.
- Parameters:
- URLs = {'s1': 'MobileCLIP-S1', 's2': 'MobileCLIP-S2'}¶
- class semanticlens.foundation_models.OpenClip(url, device='cpu', **kwargs)[source]¶
Bases:
AbstractVLMOpenCLIP vision-language model implementation.
This class provides a concrete implementation of the VisionLanguageFoundationModel abstract base class using OpenCLIP models. It supports encoding images and text into a shared embedding space for various vision-language tasks.
- Parameters:
- model¶
The loaded OpenCLIP model.
- Type:
- preprocessor¶
Image preprocessing function.
- Type:
callable
- tokenizer¶
Text tokenization function.
- Type:
callable
- __repr__()[source]¶
Return a string representation of the ClipMobile instance.
- Returns:
String representation including the model version and device.
- Return type:
- property device¶
Get the device on which the model is located.
- Returns:
The device (CPU/GPU) on which the model parameters are located.
- Return type:
- encode_image(img)[source]¶
Encode an image tensor into features.
- Parameters:
img (torch.Tensor) – Input image tensor.
- Returns:
Encoded image features.
- Return type:
- encode_text(text_input)[source]¶
Encode a text tensor into features.
- Parameters:
text_input (torch.Tensor) – Input text tensor.
- Returns:
Encoded text features.
- Return type:
- preprocess(img)[source]¶
Apply foundation model image preprocessing.
Preprocesses images for model consumption, handling both single images and lists of images. Also handles tensor dimension expansion and device placement automatically.
- Parameters:
img (Image.Image or list[Image.Image]) – Input image(s) to preprocess. Can be a single PIL Image or a list of PIL Images.
- Returns:
Preprocessed image tensor(s) ready for model input, moved to the correct device with proper batch dimensions.
- Return type:
- to(device)[source]¶
Move the model to the specified device.
- Parameters:
device (str or torch.device) – The target device to move the model to (e.g., ‘cpu’, ‘cuda:0’).
- Returns:
The model instance after moving to the specified device.
- Return type:
- tokenize(txt, context_length=None)[source]¶
Tokenize a text string and move to the correct device.
Converts input text into tokenized format suitable for the model, automatically moving the result to the model’s device.
- Parameters:
- Returns:
Tokenized text tensor ready for model input, on the correct device.
- Return type:
- class semanticlens.foundation_models.SigLipV2(device='cpu', **kwargs)[source]¶
Bases:
OpenClipSigLIP V2 vision-language model implementation.
A specialized OpenCLIP implementation using the SigLIP V2 model architecture optimized for improved vision-language understanding.
- Parameters:
device (str, optional) – The device to load the model on, by default “cpu”.
- URL = 'hf-hub:timm/ViT-B-16-SigLIP2'¶