This article is a description of my understanding how one should annotate their own dataset when working with https://github.com/microsoft/table-transformer.
Here are some example annotated images. Images sent to training are the same - it’s just the full page of the document. Output is one or multiple tightly annotated bboxes around the tables within the document.
Tables can be wrapped very tightly around the content, with a 1 pixel or less padding around the words. The table structure itself is not considered at all in the wrapping - meaning even if the table is clearly a grid with an excel, the annotation is still wrapped around the content within that grid. Additionally the wrapping is done around only the table - not including possible table titles, captions and other content.
These are made a slightly modified view_annotations.py
script. They are tightly cropped, so they do not represent what the images look like before sending to training, but reflect what the annotations look like for a given image.
These are what the images look like before they are sent to TSR training. We can see that there is substantial “padding” around the table of interest in each image. In some documents the table is isolated and it looks like there is an empty area around the image, but in other more crowded examples we can see text around the table.
Preparing your own TSR dataset for training requires that you have a tight fitting table
label around the table, which is then cropped out of the original document (full PDF page) with a padding of 30 pixels.
Cropped table that happens to have nothing around it
Cropped table with other tables, captions and other info of the document visible
From these images we can infer the following things