Data for Fine-Tuning
What Columns do I Need for the Fine-Tuning Dataset?
A minimum set of columns would be 1 text field and 1 image field with one outcome (or score) field (i.e. purchases, clicks, downloads, popularity). The text field could be from queries and the image field could be for a product image while the outcome field could be an add-to-cart event. Other text fields could be things like title. In an ideal state, both a query and title field would be present with an image field and at least one high-intent outcome field like add-to-cart, download, or purchase. These requirements are detailed in the tables below.
Text fields | Image fields | Outcome/score field | |
---|---|---|---|
Minimal | 1 | 1 | 0 |
Ideal | 2+ | 1+ | 1+ |
What Does the Fine-Tuning Dataset Look Like?
Ideally the dataset for training (fine-tuning) will have 2 text fields (queries and titles), 1 image field and 1 outcome field (add-to-cart, purchase or download are preferred). However, it is possible to use as little as 1 text and 1 image field. Example datasets are shown below.
Minimal
An example of a minimal dataset.
Title | Image | |
---|---|---|
“A funny scene” | funny_scene.jpg | |
“A business center” | biz_cent.jpg | |
… | … | |
“A beach front” | beach.jpg |
Ideal
An example of an ideal dataset. The outcome is the cumulative total of the metric that we most care about, for example downloads or add-to-cart.
Query | Title | Image | Outcome |
---|---|---|---|
“funny” | “A funny scene” | funny_scene.jpg | 1 |
“business” | “A business center” | biz_cent.jpg | 2 |
… | … | … | … |
“beach holiday” | “A beach front” | beach.jpg | 1 |
Note: The names of the columns do not have to be exactly as they appear above.
What Columns do I Need for an Evaluation Dataset?
For an evaluation dataset, you need columns that can help in assessing the performance of your model. Typically, this involves having a query column and the expected result columns. The dataset should include:
Text fields | Image fields | Score | Query Column | Result Columns | |
---|---|---|---|---|---|
Evaluation | 1+ | 1+ | Optional | 1 | 1+ |
- Text fields: These can include titles, descriptions, or other relevant text data.
- Image fields: Fields that contain the image data or pointers to the image data.
- Score/Outcome field: Optional, but useful for certain types of evaluations.
- Query Column: The main query text used to retrieve results.
- Result Columns: The columns that are expected to be returned as results for the query.
What Does the Evaluation Dataset Look Like?
An example of an evaluation dataset.
Query | Title | Image | Score | Result_Column |
---|---|---|---|---|
“funny” | “A funny scene” | funny_scene.jpg | 1 | Image |
“business” | “A business center” | biz_cent.jpg | 2 | Title |
… | … | … | … | … |
“beach holiday” | “A beach front” | beach.jpg | 1 | Score |
Note: The names of the columns do not have to be exactly as they appear above. |
How Much Data do I Need for Fine-Tuning?
Minimums vary but 100k rows of the data (as described above) would be a minimal amount with millions of rows being ideal.
How do we Create the Dataset?
Historic search logs paired with product meta-data is the best way to create the dataset. For example, if downloads is the outcome that is going to be the target then gathering the cumulative downloads for query-product pairs will be required. The product meta-data like title or description can then be added for each row in accordance with the product that appears.
Does the Age of the Data Matter?
Using the most recent data is ideal. The time-frame will be dependent on things like seasonality and how much these temporal changes affect the search patterns of users. It will also depend on how much data is available, if there is a plethora of data then using more recent data or sub-sampling older data can work. If there is less data then using data from further back in time is a good strategy.