Skip to content

Dense retrieval models

Dense retrieval models are models that take in something like text or images and return a fixed sized array. This representation is then indexed and searchable by using approximate nearest neighbour algorithms along with a simililarty measure like cosine similarity or L2 distance.


The following models are supported by default (and primarily based on the excellent sbert and Huggingface libraries and models).

These models can be selected when creating the index and are illustrated by the example below:

# Import Marqo and create a client
settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": False,
        "model": "flax-sentence-embeddings/all_datasets_v4_MiniLM-L6",
        "normalize_embeddings": True,
response = mq.create_index("my-index", settings_dict=settings)

The model field is the pertinent field for selecting the model to use. Note, once an index has been created and a model has been selected, the model cannot be changed. A new index would need to be created with the alternative model. The model will be applied to all relevant fields. Field specific settings which allow different models to be applied to different fields is not currently supported but will be coming soon (and contributions are always welcome).

Although use case specific, a good starting point is the model flax-sentence-embeddings/all_datasets_v4_MiniLM-L6. It provides a good compromise between speed and relevancy. The model flax-sentence-embeddings/all_datasets_v4_mpnet-base provides the best relevancy (in general).


ONNX versions of the above models can also be used. ONNX is an open format for models that is designed to allow interoperability of models across frameworks. Other benefits include faster inference (model and use case specific but ~2x) and lower memory usage. The ONNX conversion of the above models happens 'on the fly'. To use one of the above models as an ONNX version, simply replace the text preceding the first '/' with 'onnx'. For example;

  • onnx/all-MiniLM-L6-v1
  • onnx/all-MiniLM-L6-v2
  • onnx/all_datasets_v3_MiniLM-L12
  • onnx/all_datasets_v3_MiniLM-L6
  • onnx/all_datasets_v4_MiniLM-L12
  • onnx/all_datasets_v4_MiniLM-L6

The 'mpnet' based models are not currently supported by the ONNX conversion but will be added soon. See below for the example how to use an ONNX model:

settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": False,
        "model": "onnx/all_datasets_v4_MiniLM-L6",
        "normalize_embeddings": True,
response = mq.create_index("my-index", settings_dict=settings)


The models that are used for tensorizing images come from CLIP. We support two implementations, one from OpenAI, and the other one is an open source implementation called open clip. The following models are supported;


  • RN50
  • RN101
  • RN50x4
  • RN50x16
  • RN50x64
  • ViT-B/32
  • ViT-B/16
  • ViT-L/14
  • ViT-L/14@336px

Although use case specific, a good starting point is the model ViT-B/16. It provides a good compromise between speed and relevancy. The models ViT-L/14 and ViT-L/14@336px provides the best relevancy (in general) but are typically slower.

settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "model": "ViT-L/14",
        "normalize_embeddings": True,
response = mq.create_index("my-index", settings_dict=settings)


Some OpenAI CLIP models can be implemented in float16, ONLY when cuda device is available. This can largely increase the speed with minor loss to accuracy. In out test the inference speed is reduced by 50%, which is device dependent. Available models are:

  • fp16/ViT-L/14
  • fp16/ViT-B/32
  • fp16/ViT-B/16

You can load the model with

settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "model": "fp16/ViT-L/14",
        "normalize_embeddings": True,
response = mq.create_index("my-index", settings_dict=settings)


  • open_clip/RN50/openai
  • open_clip/RN50/yfcc15m
  • open_clip/RN50/cc12m
  • open_clip/RN50-quickgelu/openai
  • open_clip/RN50-quickgelu/yfcc15m
  • open_clip/RN50-quickgelu/cc12m
  • open_clip/RN101/openai
  • open_clip/RN101/yfcc15m
  • open_clip/RN101-quickgelu/openai
  • open_clip/RN101-quickgelu/yfcc15m
  • open_clip/RN50x4/openai
  • open_clip/RN50x16/openai
  • open_clip/RN50x64/openai
  • open_clip/ViT-B-32/openai
  • open_clip/ViT-B-32/laion400m_e31
  • open_clip/ViT-B-32/laion400m_e32
  • open_clip/ViT-B-32/laion2b_e16
  • open_clip/ViT-B-32/laion2b_s34b_b79k
  • open_clip/ViT-B-32-quickgelu/openai
  • open_clip/ViT-B-32-quickgelu/laion400m_e31
  • open_clip/ViT-B-32-quickgelu/laion400m_e32
  • open_clip/ViT-B-16/openai
  • open_clip/ViT-B-16/laion400m_e31
  • open_clip/ViT-B-16/laion400m_e32
  • open_clip/ViT-B-16-plus-240/laion400m_e31
  • open_clip/ViT-B-16-plus-240/laion400m_e32
  • open_clip/ViT-L-14/openai
  • open_clip/ViT-L-14/laion400m_e31
  • open_clip/ViT-L-14/laion400m_e32
  • open_clip/ViT-L-14/laion2b_s32b_b82k
  • open_clip/ViT-L-14-336/openai
  • open_clip/ViT-H-14/laion2b_s32b_b79k
  • open_clip/ViT-g-14/laion2b_s12b_b42k
  • open_clip/ViT-g-14/laion2b_s34b_b88k
  • open_clip/ViT-bigG-14/laion2b_s39b_b160k
  • open_clip/roberta-ViT-B-32/laion2b_s12b_b32k
  • open_clip/xlm-roberta-base-ViT-B-32/laion5b_s13b_b90k
  • open_clip/xlm-roberta-large-ViT-H-14/frozen_laion5b_s13b_b90k
  • open_clip/convnext_base/laion400m_s13b_b51k
  • open_clip/convnext_base_w/laion2b_s13b_b82k
  • open_clip/convnext_base_w/laion2b_s13b_b82k_augreg
  • open_clip/convnext_base_w/laion_aesthetic_s13b_b82k
  • open_clip/convnext_base_w_320/laion_aesthetic_s13b_b82k
  • open_clip/convnext_base_w_320/laion_aesthetic_s13b_b82k_augreg
  • open_clip/convnext_large_d/laion2b_s26b_b102k_augreg
  • open_clip/convnext_large_d_320/laion2b_s29b_b131k_ft
  • open_clip/convnext_large_d_320/laion2b_s29b_b131k_ft_soup
  • open_clip/coca_ViT-B-32/laion2b_s13b_b90k
  • open_clip/coca_ViT-B-32/mscoco_finetuned_laion2b_s13b_b90k
  • open_clip/coca_ViT-L-14/laion2b_s13b_b90k
  • open_clip/coca_ViT-L-14/mscoco_finetuned_laion2b_s13b_b90k

Like the OpenAI based models, the larger ViT based models typically perform better. For example, open_clip/ViT-H-14/laion2b_s32b_b79k is the best model for relevency (in general) and surpasses even the best models from OpenAI.

The names of the open clip models are in the format of "implementation source / model name / pretrained dataset". The detailed configurations of models can be found here.

settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "model": "open_clip/ViT-H-14/laion2b_s32b_b79k",
        "normalize_embeddings": True,
response = mq.create_index("my-index", settings_dict=settings)


Onnx versions of CLIP models are available - this is in addition to the native Pytorch versions described above. Both the original OpenAI CLIP models and open_clip models are available. The onnx models are named according to the following format:


Parameters Types Options Description
ONNX_PRECISION String onnx16 or onnx32 Precision of the model.
SOURCE String openai or open_clip The implementation of the model.
MODEL_NAME String e.g., ViT-L-14 The name (architecture) of the CLIP model.
PRETRAINED String e.g., openai, laion400m_e32 The pretrained dataset of the model, only available for open_clip models.
  • onnx32/openai/ViT-L/14
  • onnx16/openai/ViT-L/14
  • onnx32/open_clip/ViT-L-14/laion400m_e32
  • onnx16/open_clip/ViT-L-14/laion400m_e32
  • onnx32/open_clip/ViT-L-14/laion2b_s32b_b82k
  • onnx16/open_clip/ViT-L-14/laion2b_s32b_b82k
  • onnx32/open_clip/ViT-L-14-336/openai
  • onnx16/open_clip/ViT-L-14-336/openai
  • onnx32/open_clip/ViT-B-32/openai
  • onnx16/open_clip/ViT-B-32/openai
  • onnx32/open_clip/ViT-B-32/laion400m_e31
  • onnx16/open_clip/ViT-B-32/laion400m_e31
  • onnx32/open_clip/ViT-B-32/laion400m_e32
  • onnx16/open_clip/ViT-B-32/laion400m_e32
  • onnx32/open_clip/ViT-B-32/laion2b_e16
  • onnx16/open_clip/ViT-B-32/laion2b_e16
  • onnx32/open_clip/ViT-B-32-quickgelu/openai
  • onnx16/open_clip/ViT-B-32-quickgelu/openai
  • onnx32/open_clip/ViT-B-32-quickgelu/laion400m_e31
  • onnx16/open_clip/ViT-B-32-quickgelu/laion400m_e31
  • onnx16/open_clip/ViT-B-32-quickgelu/laion400m_e32
  • onnx32/open_clip/ViT-B-32-quickgelu/laion400m_e32
  • onnx16/open_clip/ViT-B-16/openai
  • onnx32/open_clip/ViT-B-16/openai
  • onnx16/open_clip/ViT-B-16/laion400m_e31
  • onnx32/open_clip/ViT-B-16/laion400m_e31
  • onnx16/open_clip/ViT-B-16/laion400m_e32
  • onnx32/open_clip/ViT-B-16/laion400m_e32
  • onnx16/open_clip/ViT-B-16-plus-240/laion400m_e31
  • onnx32/open_clip/ViT-B-16-plus-240/laion400m_e31
  • onnx16/open_clip/ViT-B-16-plus-240/laion400m_e32
  • onnx32/open_clip/ViT-B-16-plus-240/laion400m_e32
  • onnx16/open_clip/ViT-H-14/laion2b_s32b_b79k
  • onnx32/open_clip/ViT-H-14/laion2b_s32b_b79k
  • onnx16/open_clip/ViT-g-14/laion2b_s12b_b42k
  • onnx32/open_clip/ViT-g-14/laion2b_s12b_b42k
  • onnx16/open_clip/RN50/openai
  • onnx32/open_clip/RN50/openai
  • onnx16/open_clip/RN50/yfcc15m
  • onnx32/open_clip/RN50/yfcc15m
  • onnx16/open_clip/RN50/cc12m
  • onnx32/open_clip/RN50/cc12m
  • onnx16/open_clip/RN50-quickgelu/openai
  • onnx32/open_clip/RN50-quickgelu/openai
  • onnx16/open_clip/RN50-quickgelu/yfcc15m
  • onnx32/open_clip/RN50-quickgelu/yfcc15m
  • onnx16/open_clip/RN50-quickgelu/cc12m
  • onnx32/open_clip/RN50-quickgelu/cc12m
  • onnx16/open_clip/RN101/openai
  • onnx32/open_clip/RN101/openai
  • onnx16/open_clip/RN101/yfcc15m
  • onnx32/open_clip/RN101/yfcc15m
  • onnx16/open_clip/RN101-quickgelu/openai
  • onnx32/open_clip/RN101-quickgelu/openai
  • onnx16/open_clip/RN101-quickgelu/yfcc15m
  • onnx32/open_clip/RN101-quickgelu/yfcc15m
  • onnx16/open_clip/RN50x4/openai
  • onnx32/open_clip/RN50x4/openai
  • onnx16/open_clip/RN50x16/openai
  • onnx32/open_clip/RN50x16/openai
  • onnx16/open_clip/RN50x64/openai
  • onnx32/open_clip/RN50x64/openai

The onnx32 variants should get approximately the same results as the native float32 Pytorch implementations but with a lower latency. In our tests, it can reduce the index time per image by 25% (although this ultimately depends on the exact model and hardware). We encourage you to use this model if you need to index a large amount of images with the best accuracy.

The onnx16 is the float16 version of the above model. It provides even faster inference speed, with a 65% reduction compared to Pytorch and a 54% compared to the onnx32 variant (although this ultimately depends on the exact model and hardware). However, its searching accuracy is not as good as the float32 version. This is more pronunced when searching across modalities like text-image compared to within the same modality like image-image. If you really care about indexing speed but are less sensitive to accuracy, this might be your choice. Quantitative evaluation for both relevency and latency is encouraged for your use case to determine the best model.

To use these onnx CLIP models, simply specify them at index creation time:

# For openai model:
settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "model": "onnx32/openai/ViT-L/14",
        "normalize_embeddings": True,
response = mq.create_index("my-index", settings_dict=settings)
# For open_clip model:
settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "model": "onnx32/open_clip/ViT-L-14-336/openai",
        "normalize_embeddings": True,
response = mq.create_index("my-index", settings_dict=settings)

Multilingual CLIP

Marqo supports multilingual clips models that are trained on more than 100 languages, provided by this project. You can use the following models and achieve multi-modal search in your preferred language:

  • multilingual-clip/XLM-Roberta-Large-Vit-L-14
  • multilingual-clip/XLM-R Large Vit-B/16+
  • multilingual-clip/XLM-Roberta-Large-Vit-B-32
  • multilingual-clip/LABSE-Vit-L-14

These models can be specified at index creation time. Note that multilingual clip models are very large models (approximately 6GB) therefore a cuda device is highly recommended.

settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "model": "multilingual-clip/XLM-Roberta-Large-Vit-L-14",
        "normalize_embeddings": True,
response = mq.create_index("my-index", settings_dict=settings)

Generic CLIP Models

You can use your fine-tuned clip models with custom weights in Marqo. Depending on the framework you are using (we currently support model frameworks from openai clip and open_clip), you can use set up the index as:


settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "model": "generic-clip-test-model-1",
        "model_properties": {
            "name": "ViT-B-32-quickgelu",
            "dimensions": 512,
            "url": "",
            "type": "open_clip",
        "normalize_embeddings": True,
response = mq.create_index("my-generic-model-index", settings_dict=settings)

Openai CLIP

settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": True,
        "model": "generic-clip-test-model-2",
        "model_properties": {
            "name": "ViT-B/32",
            "dimensions": 512,
            "url": "",
            "type": "clip",
        "normalize_embeddings": True,
response = mq.create_index("my-generic-model-index", settings_dict=settings)

It is very important to set "treat_urls_and_pointers_as_images": True to enable the multi-modal search. The model field is required and acts as an identifying alias to the model specified through model_properties.

In model_properties, the field name is to identify the model type. dimensions specifies the dimension of the output. type shows the framework you are using. You should also provide you custom model (checkpoint) by field url. You will need to serve your model and access it via a url. For more detailed instructions, please check here.

Advanced usage If Marqo is not running on Docker, models may be stored locally and referenced using a local file pointer. By default, Marqo running within Docker will not be able to access these.

Users should conscious of the different fields model and name. model acts as an identifying alias in Marqo (for generic models, you can choose your own). name, in this case, is used to identify the CLIP architecture from OpenAI or OpenCLIP

A table of all the required fields is listed below

Required Keys for model_properties

Field Type Description
name String Name of model in library. If the model is specified by model_properties.model_location, then this parameter refers to the model architecture, for example ViT-L/14
dimensions Integer Dimensions of the model
url String The url of the custom model
type String, "clip" or "open_clip" The framework of the model

Optional fields provide further flexibilities of generic models. These fields only works for models from open_clip as this framework provides more flexibilities.

Optional Keys for model_properties

Field Type Default value Description
jit Bool False Whether to load this model in JIT mode.
precision String "fp32" The precison of the model. Optional values: "fp32" or "fp16"
tokenizer String "clip" The name of the tokenizer. We support hugging face tokenizer.
mean Tuple (0.48145466, 0.4578275, 0.40821073) The mean of the image for normalization
std Tuple (0.26862954, 0.26130258, 0.27577711) The std of the image for normalization
model_location Dictionary "" The location of the model if it is not easily reachable by URL (for example a model hosted on a private Hugging Face and AWS S3 repos). See here for examples.

Generic SBERT Models

You can also use models that are not supported by default.

settings = {
    "index_defaults": {
        "treat_urls_and_pointers_as_images": False,
        "model": "unique-model-alias",
        "model_properties": {
            "name": "sentence-transformers/multi-qa-MiniLM-L6-cos-v1",
            "dimensions": 384,
            "tokens": 128,
            "type": "hf",
        "normalize_embeddings": True,
response = mq.create_index("my-generic-model-index", settings_dict=settings)

The model field is required and acts as an identifying alias to the model specified through model_properties. If a default model name is used in the name field, model_properties will override the default model settings.

Currently, models hosted on huggingface model hub are supported. These models need to output embeddings and conform to either the sbert api or huggingface api. More options for custom models will be added shortly, including inference endpoints.

Required Keys for model_properties

Name Type Description
name String Name of model in library. This is required unless model_properties.model_location is specified.
dimensions Integer Dimensions of model
type String Type of model loader. Must be set to "hf" for generic SBERT models.

Optional Keys for model_properties

Search Parameter Type Default value Description
tokens Integer 128 Number of tokens
model_location Dictionary "" The location of the model if it is not easily reachable by URL (for example a model hosted on a private Hugging Face and AWS S3 repos). See here for examples.

Other media types

At the moment only text and images are supported. Other media types and custom media types will be supported soon.