Skip to content

Questions Regarding Dataset Annotation Process #7

@parkervg

Description

@parkervg

Hi,

I have some questions about the process for annotating ground truth answers in your TAG benchmark. There seems to be quite a few questions that are inherently subjective, with no one correct 'ground truth' answer. In addition, I see some inconsistent questions between tag_queries.csv and hand_written.py.

It would be very useful if you could share the exact outputs produced by your hand_written.py script, to see what versions of questions + annotated ground truth answers were used to report the paper performances. Any help here would be greatly appreciated!

Subjective Questions

  • pipeline_59(): Of the top 10 players taller than 180 ordered by average heading accuracy descending, what are the top 3 most unique sounding names?
    • What is the criteria for saying, e.g. 'Per Mertesacker' is a more unique name than 'Miroslav Klose', etc.?
  • pipeline_50(): Among the magnet schools with SAT test takers of over 500, which school name sounds most futuristic?
    • What defines 'most futuristic'? Choosing between, say, 'Millikan High' and 'Polytechnic High' feels subjective.
  • pipeline_51(): Of the 5 posts wih highest popularity, list their titles in order of most technical to least technical.
  • pipeline_56(): Among the posts owned by a user over 65 with a score of over 10, what are the post id's of the top 2 posts made with the least expertise?
    • How is 'least expertise' defined to the annotator?
  • pipeline_60(): Out of users that have obtained at least 200 badges, what are the top 2 display names that seem most based off a real name?
    • Why is 'Glen_b' more based off of a real name than 'whuber'?
  • pipeline_107(): Of all the comments commented by the user with a username of Harvey Motulsky and with a score of 5, rank the post ids in order of most helpful to least helpful
    • Was 'most helpful' defined in a specific way to the annotators?
  • pipeline_61(): Of the cities containing exclusively virtual schools which are the top 3 safest places to live?
    • Is a measure of 'safest place to live' defined somewhere in the BIRD database or elsewhere?
  • pipeline_62(): List the cities containing the top 5 most enrolled schools in order from most diverse to least diverse.
    • Similar question here: Is 'most diverse school' a criteria defined in the BIRD database?
  • pipeline_64(): Of the schools with the top 3 SAT excellence rate, order their counties by academic reputation from strongest to weakest.
    • A couple questions here: how is 'strongest academic reputation defined'? Additionally, while the question asks for an ordered list, the LOTUS program (and corresponding ground truth answer) returns a single item, 'Santa Clara'.
  • pipeline_65(): Among the cities with the top 10 lowest enrollment for students in grades 1 through 12, which are the top 2 most popular cities to visit?
    • How is 'most popular cities to visit' defined? The ground truth chooses 'Shaver Lake' over 'Wawona', but a quick Google search seems to indicate that Wawona/Yosemite gets far more visitors than Shaver Lake?

Dataset Inconsistencies

  • pipeline_40(): Among the players whose height is over 180, how many of them have a volley score of over 70 and are taller than Bill Clinton?
    • Judging from the variable steph_height and the example in Appendix A from the paper, it seems as though this was switched from 'Steph Curry' -> 'Bill Clinton' at some point. Which version of the dataset is reported in the paper?
  • pipeline_952(): Of the constructors that have been ranked 1 in 2014, whose logo looks most like Secretariat?
    • In tag_queries.csv, this is Of the constructors that have been ranked 1 in 2014, which has the most prestige?. Similar question - which version of the question is used in reporting performance in your paper?
  • pipeline_5(): What are the two most common first names among the female school administrators?
    • On line 94, the .head(20) function is applied, I imagine to speed up the query execution. However, this leads to a query that is no longer faithful to the original natural language question - there is no structural guarantee enforced in the database that a female name is among the top 20 most common names. A faithful query would need to call sem_filter() over all names in the schools_df table.
  • pipeline_4(): What is the grade span offered in the school with the highest longitude in counties that are part of the 'Silicon Valley' region?
    • In tag_queries.csv, 'cities' is used in place of 'counties'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions