Tutorial here. Input file in .tsv format with uniq-id, image-id, caption, predicted object labels (taken from VinVL, not used), image base64 string are separated by tabs. 162365 12455 the sun sets ...