NOTE: THIS DATA & DOCUMENTATION ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. THE DATA IS DERVIED FROM WIKIPEDIA AND OTHER SOURCES. YOU NEED TO CHECK CAREFULLY THE LICENSE OF EACH SOURCE BEFORE ANY USE.


This folder contains the three gold standard data sets used in our paper:
Evaluating Web Table Annotation Methods: From Entity Lookups to Entity Embeddings

Our aim is to make our experiments reproducible and assist the community with building state-of-the-art web table annotation systems. If you encounter any issues using the data sets, please do not hesitate to contact the authors.

T2D gold standard data is downloaded from: http://webdatacommons.org/webtables/goldstandard.html (download date: August 25, 2016). Please refer to the original website for latest details and citation information.

Limaye gold standard is downloaded from: http://fe.cs.northwestern.edu/TabEL/ (download date: August 25, 2016). Please refer to the original website and the following paper for more details and citation information:
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and Searching Web Tables Using Entities, Types and Relationships. PVLDB, 3(1):1338–1347, 2010.


**Data format**

CSV:
The .csv files are formatted as double quoted (' " ') fields, separated by commas (',').
In the tables files, each file corresponds to one table, each field represents a column, and each line represents a different row.
In the entities files, there are only three fields:
"DBpedia uri","cell string","row number"
representing the correct annotation, the string of the label column cell, and the row (starting from 0) in which this mapping is found, respectively.

Tables and entities files that correspond to the same table have the same filename.
The same formatting and naming convention is used in T2D gold standard (http://webdatacommons.org/webtables/goldstandard.html).

JSON:
Each line in a .json file corresponds to a table, written as a JSONObject. T2D and Limaye tables files contain only one line (table) per file, while the Wikipedia gold standard contains multiple lines (tables) per .json file. In T2D and Limaye, the entity mappings of those tables can be found in the entities files with the same filename, while in Wikipedia, the entity mappings of each table can be found the line of the entities files having the "tableId" field as the one of the corresponding table.

The contents of a table in .json are given as a two-dimensional array (a JSONArray of JSONArray s), called "contents". Each JSONArray in the contents represents a table row. Each element of this array is a JSONObject, representing one cell of the row. The field "data" of each cell contains the cell string contents, while there may also be a field "isHeader" to denote of the current cell is in a header row. In the Wikipedia gold standard there may also be a "wikiPageId" field, denoting the existing hyperlink of this cell to a Wikipedia page. It only contains the suffix of a Wikipedia URL, skipping the first part "https://en.wikipedia.org/wiki/".

The entity mappings files are in the same format as in csv:
["DBpedia uri","cell string",row number] inside the "mappings" field of a json file.