Even sophisticated language models such as OpenAI’s GPT-3 struggle with socially important topics like morality, history, and law. That’s the top-line finding from a new paper coauthored by Columbia, University of Chicago, and University of California, Berkeley researchers that proposes a 57-task test to measure models’ ability to reason. Models must possess problem-solving abilities and extensive knowledge about the world to perform well on the test. But in experiments, the coauthors found that the models they benchmarked — including GPT-3 — frequently didn’t know when they were wrong.
The goal of the novel test set is to bridge the gap between the knowledge that models see during training and existing measures of success in natural language processing. Like all machine learning models, language models learn patterns from vast data sets often sourced from Wikipedia, Reddit, ebooks, and other web sources. Some recently introduced benchmarks attempt to capture the linguistic skills of models, but so far, there’s little evidence to suggest a correlation between benchmark performance and a model’s grasp of commonsense reasoning.
The researchers claim their test is different in that it assesses models across subjects humans commonly learn, like mathematics, history, and ethics. To craft it, graduate and undergraduate students collected 15,908 questions from freely available sources online including practice exams for undergraduate courses, readers of the Oxford University Press, and tests like the Graduate Record Examination, U.S. Medical Licensing Examination, and Examination for Professional Practice in Psychology. The tasks range in difficulty from an elementary level to an “advanced professional level,” a sampling the coauthors argue is sufficient for identifying a model’s blind spots.

Above: Example questions from the researchers’ test set.
“We measure arbitrary real-world text understanding,” they wrote, noting that each subject contains at least 100 test examples. “Since models are pretrained on the internet, this enables us to test how well they can extract useful knowledge from massive corpora.”
In addition to GPT-3, the researchers benchmarked Google’s T5 and the Allen Institute for AI’s UnifiedQA question-answering model against their test set. The results show that meaningful progress has only become possible within recent months, with models containing up to 13 billion parameters achieving 25% accuracy and 175-billion-parameter models like GPT-3 reaching 43.9% accuracy. (Parameters are parts of the model learned from historical training data.) But that being the case, GPT-3 failed to excel at any single subject; its performance was on the test set was lopsided, with almost 70% accuracy for its best subject (U.S. foreign policy) but “near-random” performance for several other subjects (e.g., college chemistry).
“Overall, GPT-3 does poorly on highly procedural problems,” the researchers explain. “It is notably poor at modeling human (dis)approval, as evident by the low performance on the professional law and moral scenarios tasks, [and it] also has difficulty performing calculations, so much so that it exhibits poor performance on elementary mathematics and many other STEM subjects with ‘plug and chug’ problems … We speculate that is in part because GPT-3 acquires declarative knowledge more readily than procedural knowledge.”
The findings imply that current models have room for improvement, but it’s unclear whether existing techniques will suffice. As the researchers point out, previous research indicates that a 10 times increase in model size must be accompanied by an approximately 5 times increase in data, which might be logistically prohibitive.
“Aside from the tremendous expense in creating multi-trillion parameter language models, data may also become a bottleneck,” the researchers continued. “There is far less written about esoteric branches of knowledge than about everyday text.”
View original article here Source
New Apple Watch Series 6 (GPS, 44mm) - Blue Aluminum Case with Deep Navy Sport Band
$462.94 (as of January 25, 2021 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)Wyze Cam 1080p HD Indoor WiFi Smart Home Camera with Night Vision, 2-Way Audio, Works with Alexa & the Google Assistant, White, 1-Pack
$25.98 (as of January 25, 2021 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)Roku Premiere | HD/4K/HDR Streaming Media Player, Simple Remote and Premium HDMI Cable
$34.99 (as of January 25, 2021 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)New Apple iPad (10.2-inch, Wi-Fi, 32GB) - Space Gray (Latest Model, 8th Generation)
$299.00 (as of January 25, 2021 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)Blink Mini – Compact indoor plug-in smart security camera, 1080 HD video, night vision, motion detection, two-way audio, Works with Alexa – 1 camera
$34.99 (as of January 25, 2021 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)Fujifilm Instax Mini Instant Film Twin Pack (White)
$13.38 (as of January 25, 2021 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)Super Smash Bros. Ultimate - Nintendo Switch
(as of January 25, 2021 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)SUPERDANNY USB Surge Protector Power Strip Mountable Extension Cord Multiple Protection 5 Outlet 3 USB Port with Hook & Loop Fastener for iPhone iPad PC Home Office Travel Black
$16.99 (as of January 25, 2021 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)Power Strip with USB, TESSAN Mountable Flat Plug Extension Cord with 4 Widely Spaced Outlets, 3 USB Charger, 5 FT Power Cord, Compact Size Charging Station for Home, Office, Dorm Essentials, Gray
$18.99 (as of January 25, 2021 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)amFilm Tempered Glass Screen Protector for Nintendo Switch 2017 (2-Pack)
$7.99 (as of January 25, 2021 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)Amazon Auto Links: Could not resolve the given unit type, . Please be sure to update the auto-insert definition if you have deleted the unit.