terça-feira, 25 de novembro de 2014

Estatística é Computação?


A postagem da semana passada tratou da preocupação que muitos estatísticos tem de que a Estatística seja dominada ou engolida pela Matemática. Essa preocupação é baseada no passado ou nas origens da Estatística, amplamente amparada nos pilares da formulação matemática. À medida que a Ciência em geral (e a Estatística em particular) avança, muito calcada no desenvolvimento sem precedentes das capacidades computacionais, um outro perigo se configura, cada vez com mais clareza. A sequência natural é a Estatística começar a se preocupar com um cenário futuro de dominação da Estatística pela Computação, nosso outro primo mais forte.

Essa dominação me preocupa mais. A Matemática e a Estatística estão ambas caminhando na direção da Computação. Isso está acontecendo por inúmeros motivos, que talvez mereçam uma outra postagem. Mas está acontecendo e cada vez mais vemos pesquisadores de Computação tratando de assuntos da alçada da Estatística. Os exemplos mais vistosos são Big Data e Machine Learning. Existe uma série de pontos importantes para a discussão aqui. Prefiro não reinventar a roda (como algumas das técnicas de Computação fazem, ao reutilizar técnicas estatísticas com outro nome). Muito melhor é reproduzir aqui um texto que coloca de forma muito mais organizada essa discussão. 

O texto abaixo foi escrito pelo Prof. Norman Matloff, que é do depto de Computação da Universidade da Califonia (Davis) e foi membro do do depto de Estatística da mesma instituição. Esse texto foi publicado nos boletins das 2 sociedades de Estatística mais importantes do mundo: a britânica e a americana.

Quero só acrescentar que o texto apresenta um cenário otimista para a Estatística, baseado em grande parte nos problemas associados à forma que os computólogos fazem dos métodos estatísticos. Minha posição mais pessimista advém da constatação de que a sociedade dos dias de hoje valoriza mais respostas rápidas ainda que imprecisas e me parece cada vez menos disposta a esperar por uma maior certeza da corretude dos procedimentos.

Vamos ao texto...

The American Statistical Association (ASA) leadership, and many in statistics academia, have been undergoing a period of angst in the last few years. They worry that the field of statistics is headed for a future of reduced national influence and importance, with the feeling that, the field is to a large extent being eclipsed by other disciplines, notably computer science.
This is exemplified by the rise of a number of new terms - largely in computer science, such as data science, big data, and analytics, with the popularity of the term machine learning growing rapidly. To many of us, this is 'old wine in new bottles', just statistics with new terminology.
I write this as both a computer scientist and statistician. I began my career in statistics, and though my departmental affiliation later changed to computer science, most of my research in computer science has been statistical in nature.
And I submit that the problem goes beyond the ASA's understandable concerns about the well-being of the statistics profession. The broader issue is not that computer science people are doing statistics, but rather that they are doing it poorly.
This is not a problem of quality of the computer science researchers themselves - many of them are highly talented. Instead, there are a number of systemic reasons for this:
  • The computer science research model is based on very rapid publication, with the venue of choice being conferences rather than slow journals. The work is refereed, but just on a one-time basis, not with the back-and-forth interaction of journals. As a result, the work is less thoroughly conducted and reviewed.
  • Because computer science departments tend to be housed in colleges of engineering, there is heavy pressure to bring in lots of research funding, and produce lots of PhD students. There is also rapid change in fashionable research topics. Thus there is little time for deep, long-term contemplation about the problems at hand.
  • There is rampant 'reinventing the wheel-ism'. Due in part to the pressure for rapid publication and the lack of long-term commitment to research topics, most computer science researchers in statistical issues have little knowledge of the statistics literature.
For instance, consider a certain paper, by a very prominent computer science author, on the use of mixed labelled and unlabelled training data in classification. Sadly the paper cites nothing in the extensive statistics literature on this topic, consisting of a long stream of papers from 1977 to the present. The computer science 'engineering-style' research model causes a cavalier attitude towards underlying models and assumptions.
Consider, for example, a talk I attended by a machine learning specialist who had just earned her PhD at one of the very top computer science departments in the world. She had taken a Bayesian approach, and I asked why she had chosen that specific prior distribution. She couldn’t answer - she had just blindly used what her thesis adviser had given her - and moreover, she was baffled as to why anyone would want to know why that prior was chosen.
Computer science people tend to have grand, starry-eyed ambitions. On the one hand, this is a huge plus, leading to highly impressive feats such as recognising faces in a large crowd. But this mentality leads to an oversimplified view, with everything being viewed as a paradigm shift.
Neural networks epitomise this problem. Enticing phrasing such as 'neural networks work like the human brain’ blinds many computer science researchers to the fact that neural nets are not fundamentally different from other parametric and non-parametric methods for regression and classification. Among computer science folks, there is often a failure to understand that the celebrated accomplishments of 'machine learning' have come mainly from applying huge resources to a problem, rather than because fundamentally new technology has been invented.
None of this is to say that people in computer science should stay out of statistics research. But the sad truth is that the process of computer science overshadowing statistics researchers in their own field is causing precious resources (research funding, faculty slots, the best potential grad students, attention from government policy makers) to go disproportionately to computer science. Even though the statistics community is arguably better equipped to make use of them. Statistics is important to the nation and to the world, and if scarce resources aren’t being used well, it’s everyone’s loss.
What can be done? I offer the following as a start:
  • There should be more joint faculty appointments between computer science and statistics departments. Teaching a course in the 'other' department forces one to think more carefully about the issues in that field, and fosters interaction between fields.
  • Computer science undergraduates should be encouraged to pursue a double major with statistics, and to go on for graduate work in statistics. There are excellent precedents for the latter, such as Hadley Wickham and Michael Kane, both of them winners of the ASA's John Chambers Statistical Software Award.
  • Statistics researchers should be much more aggressive in working on complex, large-scale, 'messy' problems, such as the face recognition example cited earlier.
  • Statistics undergraduate and graduate curricula should be modernised (while retaining mathematical rigor). Even mathematical stat courses should involve computation. Emphasis on significance testing, well-known to be under informative at best and misleading at worst, should be reduced. Modern tools, such as cross-validation and non-parametric density/regression estimation, should be brought in early in the curriculum.
The academic world is slow to change, but the stakes are high. There is an urgent need for the fields of computer science and statistics to re-examine their roles in research, both individually and in relation to each other.

2 comentários:

  1. Relacionado a este assunto recomendo fortemente as sugestoes do meu mais recente guru :Larry Wasserman.
    Ver seu artigo Rise of the Machines:
    Ou as notas de aula de seu curso Statistical Machine Learning
    Vale a pena baixar o livro Past Present and Future of Statistical Science
    onde o artigo acima se encontra e ver tambem o de Grace Wabba

    relacionado a este

  2. A história da ciência tem vários exemplos de campos do saber "capturados" por outros, que o diga a Economia, que nas Ciências Sociais, pretenderam "engolir" as Humanidades, ou a entrada de Físicos na Biologia no pós Guerra. Bem, se a História tem alguma coisa a nos dizer sobre isso é o seguinte: a Biologia não acabou, as humanidades também não, mas as paisagens delas nunca mais foram as mesmas.