Clarification on JSON Lines Dataset for Multi-Task Fine-Tuning of Florence-2 #323
Replies: 3 comments 3 replies
-
| Hi @mariaalfaroc 👋 I don't have an answer, but I suggest looking at maestro. That's our newest project, aimed explicitly at fine-tuning multimodal models. Note that the next two weeks are intense, so they might not respond. Here's where @SkalskiP talks about the data format for Florence 2: YouTube. | 
Beta Was this translation helpful? Give feedback.
-
| Hi @LinasKo, Thanks for your response! I've reviewed the maestro documentation and the YouTube tutorial. However, in both of them, the fine-tuning process for Florence-2 is focused on a single task at a time—Object Detection (OD) or Visual Question Answering (VQA). For OD, a sample annotation from the dataset looks like this: {
  "image": "IMG_20220316_165139_jpg.rf.e4c229a9128494d17992cbe88af575df.jpg",
  "prefix": "<OD>",
  "suffix": "9 of diamonds<loc_141><loc_18><loc_404><loc_465>jack of diamonds<loc_589><loc_120><loc_789><loc_454>queen of diamonds<loc_308><loc_482><loc_570><loc_966>king of diamonds<loc_549><loc_477><loc_777><loc_904>10 of diamonds<loc_396><loc_75><loc_613><loc_458>"
}For VQA, it appears as: {
  "image": "IMG_20220316_165139_jpg.rf.e4c229a9128494d17992cbe88af575df.jpg",
  "prefix": "<VQA> How many cards are in the image?",
  "suffix": "5"
}What I'd like to know is: how should the annotations be structured if I want to fine-tune Florence-2 for both OD and VQA simultaneously? Would this structure be valid? Is this even possible? {
  "image": "IMG_20220316_165139_jpg.rf.e4c229a9128494d17992cbe88af575df.jpg",
  "prefix": ["<OD>", "<VQA> How many cards are in the image?"],
  "suffix": [
    "9 of diamonds<loc_141><loc_18><loc_404><loc_465>jack of diamonds<loc_589><loc_120><loc_789><loc_454>queen of diamonds<loc_308><loc_482><loc_570><loc_966>king of diamonds<loc_549><loc_477><loc_777><loc_904>10 of diamonds<loc_396><loc_75><loc_613><loc_458>",
    "5"
  ]
}Thank you so much again! :) | 
Beta Was this translation helpful? Give feedback.
-
| Do you guys find any answer for this? | 
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I came across the notebook discussing how to fine-tune Florence-2 for Object Detection, and I have a question regarding the structure of the JSON Lines dataset when fine-tuning for multiple tasks.
Specifically, how should the dataset be formatted if I want to fine-tune for more than one task?
Should the
prefixfield be a list of task string IDs, while thesuffixfield contains a list of strings that represent the answers for each task? For example, would the following structure be correct?{ "prefix": ["<OD>", "<OCR>"], "suffix": [ "ace of hearts<loc_345><loc_315><loc_582><loc_721>2 of hearts<loc_709><loc_115><loc_888><loc_509>3 of hearts<loc_529><loc_228><loc_735><loc_613>4 of hearts<loc_98><loc_421><loc_415><loc_845>", "answer_for_ocr" ] }Additionally, is there a guide available on how to format datasets for each task?
I appreciate any guidance on this!
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions