Documents
Add or replace documents
POST /indexes/{index_name}/documents
Add an array of documents or replace them if they already exist. If the provided index does not exist, it will be created.
If you send a document with an _id
that corresponds to an existing document, the new document will overwrite the existing document.
This endpoint accepts the application/json
content type.
Path parameters
Name | Type | Description |
---|---|---|
index_name |
String | name of the index |
Query parameters
Query Parameter | Type | Default Value | Description |
---|---|---|---|
refresh |
Boolean | true |
Forces a refresh after adding documents. This makes the documents available for searching. If you are happy to wait for the system to refresh, you can set this to false for better performance. |
device |
String | null |
The device used to index the document. This allows you to use cuda GPUs to speed up indexing, if available. Defaults to the default device set on Marqo. Options include cpu and cuda , cuda1 , cuda2 etc. The cuda option tells Marqo to use all available cuda devices. |
non_tensor_fields |
Array of Strings | [] |
The fields within these documents to not create tensors for. Tensor search cannot be performed on these fields in these documents; pre-filtering and lexical search are still viable. |
use_existing_tensors |
Boolean | false |
Setting this to true will get existing tensors for unchanged fields in documents that are indexed with an id. Note: Marqo analyses the field string for updates, so Marqo can't detect a change if a URL points to a different image. |
image_download_headers |
Dict | {} |
A JSON-serialised, URL encoded dictionary of headers for image download. Can be used to authenticate the images for download. |
mappings |
Dict | null |
A JSON-serialised, URL encoded dictionary to handle object fields in documents. Check mappings for more information. Mappings are required to create multimodal tensor combination fields - see here for more information |
model_auth |
Dict | null |
A JSON-serialised, URL encoded dictionary that consists of authorisation details used by Marqo to download non-publicly available models. Check here for more information. |
Body
An array of documents. Each document is represented as a JSON object.
You can optionally set a document's ID with the special _id
field. The _id
must be a string type. If an ID is not specified, marqo will generate one.
[
{
"Title": "The Travels of Marco Polo",
"Description": "A 13th-century travelogue describing Polo's travels"
},
{
"Title": "Extravehicular Mobility Unit (EMU)",
"Description": "The EMU is a spacesuit that provides environmental protection",
"_id": "article_591"
}
]
Example
curl -XPOST 'http://localhost:8882/indexes/my-first-index/documents?non_tensor_fields=Title&non_tensor_fields=Genre' \
-H 'Content-type:application/json' -d '
[
{
"Title": "The Travels of Marco Polo",
"Description": "A 13th-century travelogue describing the travels of Polo",
"Genre": "History"
},
{
"Title": "Extravehicular Mobility Unit (EMU)",
"Description": "The EMU is a spacesuit that provides environmental protection",
"_id": "article_591",
"Genre": "Science"
}
]'
mq.index("my-first-index").add_documents([
{
"Title": "The Travels of Marco Polo",
"Description": "A 13th-century travelogue describing the travels of Polo",
"Genre": "History"
},
{
"Title": "Extravehicular Mobility Unit (EMU)",
"Description": "The EMU is a spacesuit that provides environmental protection",
"_id": "article_591",
"Genre": "Science"
}], non_tensor_fields=["Title", "Genre"]
)
mq.addDocuments([{
Title: "The Travels of Marco Polo",
Description: "A 13th-century travelogue describing the travels of Polo"
}, {
Title: "Extravehicular Mobility Unit (EMU)",
Description: "The EMU is a spacesuit that provides environmental protection",
_id: "article_591"
}],
"my-first-index"
)
Response: 200 OK
{
"errors":false,
"items":[
{
"_id":"5aed93eb-3878-4f12-bc92-0fda01c7d23d",
"result":"created",
"status":201
},
{
"_id":"article_591",
"result":"updated",
"status":200
}
],
"processingTimeMs":6,
"index_name":"my-first-index"
}
_id
generated by Marqo.
In this example, there was already a document in Marqo with _id
= article_591
, so it was updated
rather than created
.
In both the cURL and python examples, fields Title
and Genre
do not have tensors for these documents. They cannot be searched with tensor search. JS does not currently support non_tensor_fields
.
Get one document
GET /indexes/{index_name}/documents/{document_id}
Path parameters
Name | Type | Description |
---|---|---|
index_name |
String | name of the index |
document_id |
String | ID of the document |
Query parameters
Search parameter | Type | Default value | Description |
---|---|---|---|
expose_facets |
Boolean | False | If true, the document's tensor facets are returned. This is a list of objects. Each facet object contains document data and its associated embedding (found in the facet's _embedding field) |
Example
curl -XGET 'http://localhost:8882/indexes/my-first-index/documents/article_591?expose_facets=true'
mq.index("my-first-index").get_document(
document_id="article_591",
expose_facets=True
)
Response: 200 OK
{'Blurb': 'A rocket car is a car powered by a rocket engine. This treatise '
'proposes that rocket cars are the inevitable future of land-based '
'transport.',
'Title': 'Treatise on the viability of rocket cars',
'_id': 'article_152',
'_tensor_facets': [{'Title': 'Treatise on the viability of rocket cars',
'_embedding': [-0.10393160581588745,
0.0465407557785511,
-0.01760256476700306,
...]},
{'Blurb': 'A rocket car is a car powered by a rocket '
'engine. This treatise proposes that rocket cars '
'are the inevitable future of land-based '
'transport.',
'_embedding': [-0.045681700110435486,
0.056278493255376816,
0.022254955023527145,
...]}]
}
GET document
request was sent with the expose_facets
parameter set to true
.
The _tensor_facets
field is returned as a result. Within each facet, there is a key-value pair that
holds the content of the facet, and an _embedding
field, which is the content's vector representation.
Get multiple documents
GET /indexes/{index_name}/documents
This endpoint accepts the application/json
content type.
Path parameters
Name | Type | Description |
---|---|---|
index_name |
String | name of the index |
Query parameters
Search parameter | Type | Default value | Description |
---|---|---|---|
expose_facets |
Boolean | False | If true, the documents' tensor facets are returned. This is a list of objects. Each facet object contains document data and its associated embedding (found in the facet's _embedding field) |
Body
An array of IDs. Each ID is a string.
["article_152", "article_490", "article_985"]
Example
curl -XGET http://localhost:8882/indexes/my-first-index/documents -H 'Content-Type: application/json' -d '
["article_152", "article_490", "article_985"]
'
mq.index("my-first-index").get_documents(
document_ids=["article_152", "article_490", "article_985"]
)
Response 200 OK
{'results': [{'Blurb': 'A rocket car is a car powered by a rocket engine. This '
'treatise proposes that rocket cars are the inevitable '
'future of land-based transport.',
'Title': 'Treatise on the viability of rocket cars',
'_found': true,
'_id': 'article_152'},
{'_found': false, '_id': 'article_490'},
{'Blurb': "One must maintain one's space suite. It is, after all, "
'the tool that will help you explore distant galaxies.',
'Title': 'Your space suit and you',
'_found': true,
'_id': 'article_985'}]}
article_490
. As a result, the _found
field is false
.
Delete documents
Delete documents identified by an array of their IDs.
POST /indexes/{index-name}/documents/delete-batch
Path parameters
Name | Type | Description |
---|---|---|
index_name |
String | name of the index |
Body
An array of document IDs, to be deleted.
[ "article_591", "article_602" ]
Example
curl -XPOST http://localhost:8882/indexes/my-first-index/documents/delete-batch -H 'Content-type:application/json' -d '[
"article_591", "article_602"
]'
mq.index("my-first-index").delete_documents(ids=["article_591", "article_602"])
Response 200 OK
{
"index_name":"my-first-index",
"status":"succeeded",
"type":"documentDeletion",
"details":{
"receivedDocumentIds":2,
"deletedDocuments":1
},
"duration":"PT0.084367S",
"startedAt":"2022-09-01T05:11:31.790986Z",
"finishedAt":"2022-09-01T05:11:31.875353Z"
}
Model Auth
Parameter: model_auth
Expected value: URL-encoded JSON object with either an s3
or an hf
model store authorisation object.
Default value: null
The model_auth
object allows searching on indexes that use OpenCLIP and CLIP models from private Hugging Face and AWS S3 stores.
The model_auth
object contains either an s3
or an hf
model store authorisation object. The model store
authorisation object contains credentials needed to access the index's non publicly accessible model. See the example for details.
The index's settings must specify the non publicly accessible model's location in the setting's model_properties
object.
model_auth
is used to initially download the model. After downloading, Marqo caches the model so that it doesn't need to be redownloaded.
Example: AWS S3
# Create an index that specifies the non-public location of the model.
# Note the `auth_required` field in `model_properties` which tells Marqo to use
# the modelAuth it finds during add_documents to download the model
mq.create_index(
index_name="my-cool-index",
settings_dict={
"index_defaults": {
"treat_urls_and_pointers_as_images": True,
"model": 'my_s3_model',
"normalize_embeddings": True,
"model_properties": {
{
"name": "ViT-B/32",
"dimensions": 512,
"model_location": {
"s3": {
"Bucket": "<SOME BUCKET>",
"Key": "<KEY TO IDENTIFY MODEL>",
},
"auth_required": True
},
"type": "open_clip",
}
}
}
}
)
# Specify the authorisation needed to access the private model during add_documents:
# We recommend setting up the credential's AWS user so that it has minimal
# accesses needed to retrieve the model
mq.index("my-cool-index").add_documents(
auto_refresh=True, documents=[
{'Title': 'The coolest moon walks'}
],
model_auth={
's3': {
"aws_access_key_id" : "<SOME ACCESS KEY ID>",
"aws_secret_access_key": "<SOME SECRET ACCESS KEY>"
}
}
)
Example: Hugging Face (HF)
# Create an index that specifies the non-public location of the model.
# Note the `auth_required` field in `model_properties` which tells Marqo to use
# the modelAuth it finds during add_documents to download the model
mq.create_index(
index_name="my-cool-index",
settings_dict={
"index_defaults": {
"treat_urls_and_pointers_as_images": True,
"model": 'my_hf_model',
"normalize_embeddings": True,
"model_properties": {
{
"name": "ViT-B/32",
"dimensions": 512,
"model_location": {
"hf": {
"repo_id": "<SOME HF REPO NAME>",
"filename": "<THE FILENAME TO DOWNLOAD>",
},
"auth_required": True
},
"type": "open_clip",
}
}
}
}
)
# specify the authorisation needed to access the private model during add_documents:
mq.index("my-cool-index").add_documents(
auto_refresh=True, documents=[
{'Title': 'The coolest moon walks'}
],
model_auth={
'hf': {
"token" : "<SOME HF TOKEN>",
}
}
)