The latest studies on AI agree: in the future they will be much smarter than us

The latest studies on AI agree: in the future they will be much smarter than us

[ad_1]

How smart is artificial intelligence really? The scientific community grapples with the formulation of precise measures to test the performance of the algorithms

It seems like an idle question, because important experts in the sector have already given quite clear answers on the matter. Turing Award winner Geoffrey Hinton, for example, recently said: “I’ve changed my mind that these things are going to be smarter than us. I think we are very close now and that in the future they will be much smarter than us”. Another Turing Award winner, Yoshua Bengio, similarly stated: “Recent advances suggest that the future where we know how to build superintelligent AIs (smarter than humans across the board) is also closer than most people expected just a year ago.”

Faced with such statements, it is urgent to bring the question back into the context of studies and data, as Melanie Mitchell, a professor at the Santa Fe Institute in the USA and a specialist in artificial intelligence and cognition sciences did in Science. The point raised by the professor is the following: to know if and in what sense the performance of AI algorithms is at the level of human intelligence, it is necessary to make precise measurements; for this purpose, tests used for people are usually used, with which the different versions of artificial intelligence are tested. In doing so, however, one must be extremely careful about effects that, in case those tests are used on human intelligenceare irrelevant, but instead have a fundamental weight when used to evaluate whether an algorithm is capable of cognitive performance similar or superior to that of humans.

The first of these factors is the contamination of the data used for AI training, which may contain the solutions to the tests used for evaluation. While it is very unlikely that a person has already seen the solution to a question presented to assess their mental abilities, AI models such as GPT-4 have been trained with huge amounts of digital data. For example, OpenAI, the maker of GPT-4, said its AI system matched human performance in various professional and academic tests. Faced with the objection that the algorithm could simply have encountered most of the problems and their respective answers in the data set used for its training, OpenAI did not provide access to this set, but declared that it had checked to exclude this possibility, searching for substrings of the passed queries within the data set in question. This method, of course, is absolutely superficial, because it excludes the case in which the same questions are present, with small textual variations (which GPT-4 is used to dealing with); consequently, as Professor Mitchell recalled, the scientific community raised heavy criticisms, also because it turned out that the answers provided by GPT-4 were much more accurate for tests developed before 2021, i.e. before the time limit of the collection of data used for your training.

Secondly, it has been widely demonstrated that the currently popular AI systems lack robustness in solving problems of different complexity. That is, by presenting the same question in different forms, answers are sometimes obtained that lead to the suspicion of some form of intelligence, as well as completely erroneous solutions; that is, the presumed intelligence is often reduced to the recognition of the form of the problem, which is recursively traced back through the linguistic model used to a form of the same already examined (in cases in which a sensible answer is obtained).

Finally, there is an even more fundamental problem, which consists in using evaluation systems that are deeply flawed in their construction, in a way that can be exploited by AI to find completely different heuristics from correct reasoning and which are strictly dependent on the defect of the data used for the test. It has been shown that the reference datasets used to train AI systems can allow for subtle statistical associations that machines can use to produce correct answers, without actually “understanding” a damn thing.. For example, one study found that an AI system that successfully classified malignant tumors in skin images used the presence of a ruler in the images as an important cue (images of non-malignant tumors tended not to include rulers). Another study showed that an AI system that achieved human-level performance on a benchmark for assessing reasoning skills relied on the fact that correct answers were (unintentionally) more likely to contain certain keywords. For example, it was found that answer choices containing the word “not” were more likely to be correct.

These examples show how AI systems can identify biases in the data used first for their training and then for the evaluation of their abilities, biases that are insignificant for evaluating a human being because they are not perceptible to our brain, but instead are very useful for obtaining heuristics that apparently produce the desired behavior, but in reality represent statistical traps that poison the capabilities of the machine. The problems listed, along with others that are currently under study, all illustrate a key point: using tests adapted to evaluate the cognitive abilities and the performance of the human being can be extremely misleading, when it comes to evaluating the performance and the capabilities of an AI (and perhaps, by extension, also of brains other than the human one). Before making statements on the point we have reached in reproducing human intelligence, therefore, and without even addressing the problem of what this is, it is urgent to develop measurement capabilities and investigative tools that do not produce the embarrassing drawbacks briefly mentioned here.

[ad_2]

Source link