-
Notifications
You must be signed in to change notification settings - Fork 85
Open
Description
Hi,
I have some questions about the process for annotating ground truth answers in your TAG benchmark. There seems to be quite a few questions that are inherently subjective, with no one correct 'ground truth' answer. In addition, I see some inconsistent questions between tag_queries.csv and hand_written.py.
It would be very useful if you could share the exact outputs produced by your hand_written.py script, to see what versions of questions + annotated ground truth answers were used to report the paper performances. Any help here would be greatly appreciated!
Subjective Questions
pipeline_59(): Of the top 10 players taller than 180 ordered by average heading accuracy descending, what are the top 3 most unique sounding names?- What is the criteria for saying, e.g. 'Per Mertesacker' is a more unique name than 'Miroslav Klose', etc.?
pipeline_50(): Among the magnet schools with SAT test takers of over 500, which school name sounds most futuristic?- What defines 'most futuristic'? Choosing between, say, 'Millikan High' and 'Polytechnic High' feels subjective.
pipeline_51(): Of the 5 posts wih highest popularity, list their titles in order of most technical to least technical.pipeline_56(): Among the posts owned by a user over 65 with a score of over 10, what are the post id's of the top 2 posts made with the least expertise?- How is 'least expertise' defined to the annotator?
pipeline_60(): Out of users that have obtained at least 200 badges, what are the top 2 display names that seem most based off a real name?- Why is 'Glen_b' more based off of a real name than 'whuber'?
pipeline_107(): Of all the comments commented by the user with a username of Harvey Motulsky and with a score of 5, rank the post ids in order of most helpful to least helpful- Was 'most helpful' defined in a specific way to the annotators?
pipeline_61(): Of the cities containing exclusively virtual schools which are the top 3 safest places to live?- Is a measure of 'safest place to live' defined somewhere in the BIRD database or elsewhere?
pipeline_62(): List the cities containing the top 5 most enrolled schools in order from most diverse to least diverse.- Similar question here: Is 'most diverse school' a criteria defined in the BIRD database?
pipeline_64(): Of the schools with the top 3 SAT excellence rate, order their counties by academic reputation from strongest to weakest.- A couple questions here: how is 'strongest academic reputation defined'? Additionally, while the question asks for an ordered list, the LOTUS program (and corresponding ground truth answer) returns a single item, 'Santa Clara'.
pipeline_65(): Among the cities with the top 10 lowest enrollment for students in grades 1 through 12, which are the top 2 most popular cities to visit?- How is 'most popular cities to visit' defined? The ground truth chooses 'Shaver Lake' over 'Wawona', but a quick Google search seems to indicate that Wawona/Yosemite gets far more visitors than Shaver Lake?
Dataset Inconsistencies
pipeline_40(): Among the players whose height is over 180, how many of them have a volley score of over 70 and are taller than Bill Clinton?- Judging from the variable
steph_heightand the example in Appendix A from the paper, it seems as though this was switched from 'Steph Curry' -> 'Bill Clinton' at some point. Which version of the dataset is reported in the paper?
- Judging from the variable
pipeline_952(): Of the constructors that have been ranked 1 in 2014, whose logo looks most like Secretariat?- In
tag_queries.csv, this is Of the constructors that have been ranked 1 in 2014, which has the most prestige?. Similar question - which version of the question is used in reporting performance in your paper?
- In
pipeline_5(): What are the two most common first names among the female school administrators?- On line 94, the
.head(20)function is applied, I imagine to speed up the query execution. However, this leads to a query that is no longer faithful to the original natural language question - there is no structural guarantee enforced in the database that a female name is among the top 20 most common names. A faithful query would need to callsem_filter()over all names in theschools_dftable.
- On line 94, the
pipeline_4(): What is the grade span offered in the school with the highest longitude in counties that are part of the 'Silicon Valley' region?- In
tag_queries.csv, 'cities' is used in place of 'counties'
- In
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels