|Data Science News|
|University Data Science News|
Leslie Mitchell at NYU Langone Medical Center is building a synthetic genome from scratch. "We are no longer limited to the study of cells that are a product of evolution," she explains. Geneticist's ability to "edit mammalian systems" is daunting. "It is probably naive to rely on altruism. Even the best intentions can go awry. Technical limitations will only impede progress on building increasingly complex genetic systems for so long. I'm an advocate for total transparency...and an inclusive approach." Indeed, her team includes labs around the world.
Yuan Ji, Oded Rozenbaum and Kyle Welch of George Washington University School of Business, scraped "1,112,476 employee ratings of 14,282 public firms in the period 2008-2015" from Glassdoor and found that employee ratings are a good predictor of SEC fraud violations. They hypothesize that when firms are under pressure, managers may pressure employees to meet targets, resulting in grumpy employees who write negative comments on Glassdoor. If the managers can't squeeze enough value out of employees, they may try misreporting or other creative fraudulent behaviors to meet the firm's goals.
Peter Szolovits at MIT CSAIL, a leading expert in using natural language processing in precision medicine applications, explained his goal. It's not "trying to get all doctors to 'work as well as the best'." Instead, it is much better to get "the least-skilled doctors [to perform like] average" doctors. In a refreshingly nonchalant dismissal of IoT for healthcare practice, he reminded everyone that, "IoT today is full of security holes" and not at all "ready for prime time." Then he captured this organizational sociologist's heart by declaring that the unrealized potential of electronic health records is not technical, it is "institutional and policy-based."
Arizona State University, the University of Houston, and the NSF have partnered to create an industry-academia research hub for neurotechnology called The BRAIN Center. The research will focus on improving patient outcomes for those with injuries to or degeneration of the central nervous system.
A Stanford University team has launched DAWN, a project to "democratize AI and machine learning." Within the next five years they aim to "build out the toolbox that we believe will empower the 99.9 percent to build and deploy their own world-class data products, quickly and cheaply."
Hahrie Han, a political scientist at UC-Santa Barbara, explains the March for Science. She knows that the disparate goals, "makes it harder to translate whatever happens in the march to political influence. And related to my points about centralization or decentralization, one of the challenges is what happens to the coalition afterwards? If they’re too disparate or fragmented, it could be harder to coalesce around shared goals." Han goes on to point out that, "the thing that is most predictive of whether any pressure group is able to achieve its political goals is the extent to which it has relationships with political elites." Charismatic scientists among us, this is the call to use that trait for the betterment of science by persuading as many politicians and voters as you can that science is worth funding.
The Gordon and Betty Moore Foundation will now require all grantees to make their grant-funded publications "openly available within 12 months of publication, either on the journal’s site or in an open access repository." They allow grant funds to cover the cost of fees associated with open access publishing. Great decision, it follows the Bill and Melinda Gates Foundation. I wonder what the conversation was like around the decision to cover what many consider to be rather ridiculous OA publishing fees. A related fee is the cost to publish data to an open data archive like Dryad, whose organizers found that 96 percent of their users do not budget for data publication fees. One-quarter of their respondents paid the publication fees on their personal credit cards and were not reimbursed.
Elsewhere in publishing problems, PubMed is now publishing funding information in its abstracts to make potential conflicts of interest more obvious.
Duke, Stanford, and Verily (an Alphabet company) have announced the first initiative of Project Baseline, which is to recruit 10,000 participants and track their detailed health data over at least four years. The amount of data collected is extensive, consisting of: "repeat clinical visits; daily use of a wrist-worn investigational device and other sensors; and regular participation in interactive surveys and polls by using a smartphone, computer or call center." This is similar to the Kavli Foundation's HUMAN Project which is studying the lives of 10,000 New Yorkers 13 or older for "decades." Both projects seem ethically dubious, though the Duke+Stanford+Verily initiative avoids working with minors, limits participants' surveillant period to four years, and is planning to have participant feedback on conference calls throughout the term of the study.
|Company Data Science News|
Google has announced it will carry a built-in adblocker for the Chrome browser. The thinking is that Google won't have its users running off to get third-party ad blockers, some of which demand payments from ad placement companies (like Google) to white list their ads. Watch dog groups and adtech competitors are concerned that this type of consolidation of power between the ad placing side of Google and the browser producing side of Google raises the baronesque specter of greedy, arrogant, monopolistic dominance. Simmer down, now, simmmmer down: Google's motto is don't be evil.
Theranos isn't exactly data science news, but one of my dirty secrets is that I enjoy rubbernecking wannabe unicorns as they turn out to be stubborn mules. Allegations filed against it by one of its hedge fund investors claim that it, "misled company directors about its laboratory-testing practices, used a shell company to 'secretly' buy commercial-lab equipment, and improperly created rosy financial projections for investors" and ran "fake demonstration tests" of its blood testing product (Wall Street Journal, 2017). If you're going to lie, might as well cheat and steal, too. One of the reasons to go to an actual university - Elizabeth Holmes dropped out of Stanford - is to become well-rounded, maybe take an ethics class or two.
Comcast has promised not to sell consumers' internet traffic data, even though federal rules will allow ISPs to do so. Professor Kevin Werbach at Wharton explains that in such a rapidly changing industry, "no one can say definitively we're sure what will happen....[W]e’ve seen time and time again with technology and privacy that companies keep coming up with new business models and new practices that weren’t anticipated before."
Yann LeCun head of Facebook AI and Professor at NYU describes how the Deep Learning Conspiracy, a small group consisting of LeCun, Yoshua Bengio, and Geoff Hinton, incubated deep neural network models during the AI Winter in a profile by CNBC.
Washington, DC is the top-ranked city for women in tech. Scores were based on:the gender pay gap (women in tech make 94.8 percent of what men make in DC); income after housing costs ($56k); women as a percent of the tech work force (41 percent);the 4-year tech employment growth (17 percent).Silicon Valley did not crack the top ten. New York was 7th.
The AI talent wars are raging. Amazon is projected to spend $227.8m to hire new employees with machine learning skills. "Microsoft Research head Peter Lee compared recruiting AI talent in the field of deep learning to recruiting a top NFL quarterback." As we've mentioned before, Apple should be expected to do well, but is flailing, a problem blamed on its secrecy and siloed organizational culture.
Bose headphones may be capturing data about what users' listen to and sending it to third parties via their Bose Connect app. The company is being sued for violating the WireTap Act by Kyle Zak. A Bose spokesperson called the charges "inflammatory, misleading."
Google is making its voice recognition technology available to its cloud customers. The use cases are tasks like transcription, voice commands, and integration with other software for foreign language translation.
Descartes Labs, a start-up in Los Alamos, uses satellite imagery and AI to predict food supplies and crisis level food shortages months in advance. This leaves enough time to mount orderly humanitarian responses or optimize food supply networks. This is just one application for their atlas, which could enable a range of powerful land use predictions.
Planet, another company with a wealth of satellite imagery, is hosting a Kaggle competition to develop machine learning for forestry applications (1st = $30k; 2nd = $20k, 3rd = $10k). The goal is monitor deforestation, agricultural changes, and illegal mining. Once these changes can be accurately identified, governments can react to illegal activity and slow the rate at which global forests are lost and damaged.
|Government Data Science News|
Pittsburgh and Boston are racing to become the key hubs for robotics research and development. The Department of Defense is housing its advanced robotics institute at Carnegie Mellon University, which has long been home to leading robotics researchers. In Boston, Northeastern University's new Interdisciplinary Science and Engineering Complex features a major robotics research center.
US Representative Derek Kilmer (D-Washington) introduced the Open Government Data Act which calls for the federal government to share machine readable data by default, with exceptions for personal and security-sensitive data. A new MIT-CMU report about the impact of AI on the US job market concluded that in order to forecast changes in specific industries we need a new federal data gathering initiative.
Steve Ballmer launched USAFacts, a non-profit project to display US government statistics. So far, the website is mostly sparklines and large-font single numbers: median age in American 37.8! He also produced a 2017 10-K pdf for the US which offers a Ballmer-esque perspective on the intersection of business and civic institutions. Note to Steve: open government advocates are not huge fans of pdfs.
The FCC auctioned off 175 radio wave frequencies transferring the balance from old broadcast television stations to newer broadcast TV stations and internet service providers. T-Mobile, the only mobile phone provider to make significant investments, spent $8b.
Dawn Tilbury has been appointed Assistant Director for Engineering at NSF. Tilbury is a mechanical engineering professor at the University of Michigan working on mobile robotics and passionate about mentoring junior faculty in STEM fields, especially junior women faculty. I'm glad to see that in federal science leadership.
Mattel will develop Aristotle, an intelligent robot for children that is designed to answer their questions, nurture them, and remain age appropriate as they grow. The toy maker was planning a partnership with Amazon's Alexa, but has dropped that option in favor of an as-yet unknown partner. Mattel is opening the platform to outside developers, one of which may have thousands of books loaded into the device, accompanied by images it can project on the ceiling to enable digital bedtime stories. Protecting child user's privacy is part of the discussion; details are skimpy.
Bickering with your partner? There's an app for that! Researchers in the Couple Mobile Sensing Project (what a name) at USC got 34 couples to have their speech and GPS coordinates captured while wearables measured their skin conductivity, physical activity, and body temperature. Nineteen of the thirty-four couples reported having a conflict during the one day collection period. (I'm sensing that one member of these couples may have been more into this research project than the other, manufacturing conflicts that may not otherwise have occurred. Ah, what we do in the name of science.) The machine learning process accurately predicted the conflicts 79.3 percent of the time. The hope is that these predictions can allow apps on the phone to offer 'helpful suggestions' to ameliorate conflict. Adding two smart aleck phones to a bickering couple does not immediately strike me as a good idea.
Johan van der Beek used AI to optimize the board game Monopoly, adjusting the rent and fees on some properties to ensure every player has an equal chance of winning. Van der Beek raised the payout on Water Works from 4x what's shown on the dice to 7x. What if the player also owns the Electric Company? The fee goes from 10x to 17x.
Using a remote sensing technique and historic elephant census data, photographer Morgan Trimble and Ashley Robson predicted that 75 percent of Africa's elephants are 'missing' due to poaching.
Tahany Als recorded students' computer use in her Earth 222/Environ 232 course at the University of Michigan. She then created a slide listing all of the things they do online during class. Some activity is predictable - Facebook, NYTimes, shopping, their programming homework. Other activity is less predictable: photoshopping President Trump's head onto muppets, breaking up with a boyfriend over chat (not recommended).
The US Census released a descriptive page-turner about the lives of young people: "The Changing Economics and Demographics of Young Adulthood From 1975 to 2016". [Netflix also has a controversial new original show that could be said to be about the changing lives of young people: 13 Reasons Why.] A few quick highlights from the Census report:"over half of Americans believe that marrying and having children are not very important in order to become an adult""In the 1970s, 8 in 10 people married by the time they turned 30. Today, not until the age of 45 have 8 in 10 people married""More young people today live in their parents' home than in any other arrangement: 1 in 3 young people, or ~24 m 18- to 34-year olds, lived in their parents' home in 2015.""In 1975, 25 percent of young men ages 25-34 had incomes of less than $30k per year. By 2016, that share rose to 41 percent (incomes are in 2015 dollars)."
Kevin Ho of IDEO created an interactive font map that is clustered using a convolutional neural network. Ho writes that, "choosing a font is one of the most common visual decisions a designer makes" but without an easy to navigate catalogue they typically, "fall back on fonts they’ve used before or search within categories like serif, san-serif, or grotesque." In this case, the application of computation could lead to more creativity, not less, if we define creativity as building ideas in areas outside one's status quo.
|Data Visualization of the Week|
|The Pudding, Russell Goldenberg from April 18, 2017|
|Tweet of the Week|
|Twitter, XKCD from April 20, 2017|
|Cloudera Government Forum |
Washington, DC Tuesday, April 25, at The Newseum [registration required]
|Moore-Sloan Data Science Lunch Seminar Series |
New York, NY Wednesday, April 26, Ciro Cattuto from Fondazione ISI, 1:30 - 2:30 p.m. at the NYU Center for Data Science, 60 5th Avenue, 7th Floor. Lunch provided. [free]
|Georgia Tech Cyber Security Town Hall: New Standards for Controlled Unclassified Information |
Atlanta, GA Wednesday, April 26, starting at 10 a.m. in the Student Center Theatre. ... Failure to adhere to security standards could lead to the loss of funding for contracts and other projects. In order to protect this funding, Georgia Tech Cyber Security is developing a strategy to help our community comply with these standards and is looking to engage the campus in this discussion. [free]
|Databite No. 98: Eric Horvitz |
New York, NY April 26 at 4 p.m., Data & Society, 36 West 20th Street, 11th Floor [rsvp required]
|Text as Data Speaker Series |
New York, NY 4-5:30 pm, Thursday, April 27, Brendan T. O’Connor (UMass Amherst), 60 Fifth Ave (7th Floor). [free]
|Join us for the launch of the SAP Next-Gen program in New York in partnership with Hasso Plattner Institute |
New York, NY April 27 at10 Hudson Yards [free, waiting list only]
|the MaD Seminar |
New York, NY April 27 at 2:30 p.m., Joel Tropp from Caltech hosted by the NYU Center for Data Science, location: 12 Waverly Pl, L120. [free]
|NYU Computer Science Department Seminar |
New York, NYApril 28, at 9:30 a.m., Columbia University, Schapiro (CEPSR) Building, Davis Auditorium (Room 412), Speaker: New York Area Theory Daywith speakers from IBM/NYU/Columbia [free]
|National Transportation Data Challenge: Launch Event |
Seattle, WA May 2-3. Participants include cloud computing leaders, nonprofit organizations, entrepreneurs, and the Chief Data Officer of the U.S. Department of Transportation. [$$]
|CITP Conference: Ethics of Computer Science Research |
Princeton, NJ Friday, May 5, starting at 9 a.m., Princeton University (Frist Campus Center, Muti-Purpose Rooms B & C) [please RSVP]
New York, NY FinClusion is a hackathon weekend that brings together individuals from various backgrounds to find a solution to why there is a lack of inclusion in Financial Technology. Friday, May 5. [$$]
|!!Con 2017 |
New York, NY !!Con (pronounced “bang bang con”) is two days of ten-minute talks (with lots of breaks, of course!) on May 6-7. !!Con is a pay-what-you-want conference.
|Columbia University Causal Inference Conference: Varying treatment effects |
New York, NY Saturday, May 6 [free, sold out]
|SlatorCon London |
London, England May 9 starting at 2 p.m., Ace Hotel London Shoreditch [$$$]
|Inclusive AI: Technology and Policy for a Diverse Urban Future |
Berkeley, CA May 10 starting at 10:30 a.m., organized by CITRIS and the Banatao Institute [$$]
|JupyterDay Philly |
Philadelphia, PA Thursday, May 18-19 at Bryn Mawr College. [$$]
|Big Data Finance Conference 2017 |
New York, NY May 19 starting at 8 a.m., NYU Center for Data Science [$$$]
|Machine Learning in Healthcare: Industry Applications |
Boston, MA May 24 at Merck Research Laboratories. [Invitation Only]
New York, NY The World of Cloud Computing All in One Place! Cloud Computing - Internet of ThingsBig Data | Analytics - FinTechDevOps - Containers - Microservices ... June 6-8 [$$$$]
|Save the date for South Big Data Hub All Hands meeting | Hubbub! |
Chevy Chase, DC June 9 at Microsoft’s Chevy Chase Pavilion. [free to SBDH members]
|ACM SIGHPC / Intel Computational & Data Science Fellowship|
Specifically targeted at women or students from racial/ethnic backgrounds that have not traditionally participated in the computing field, the program is open to students pursuing degrees at institutions anywhere in the world. Deadline for nominations is April 30.
|Nominations - Society for Political Methodology Statistical Software Award|
The award recognizes "individual(s) for developing statistical software thatmakes a significant research contribution. Deadline for nominations is May 12.
|Mozilla Fellows for Science - 2017 |
We're looking for researchers with a passion for open source and data sharing, already working to shift research practice to be more collaborative, iterative and open. Call closes on May 14.
|Call for the 2017 Next Generation Data Scientist (NGDS) Award|
The Steering Committee of the IEEE International Conference on Data Science and Advanced Analytics decided to launch the prestigious award: Next Generation Data Scientist Awards, to address this gap and encourage young talents to conduct foundational research and applied innovation work in Data Science and Analytics. Deadline for award applications is May 25.
|FT Future of Fintech Awards|
The awards recognise and reward companies able to demonstrate innovative ideas capable of creating lasting change in the financial services sector, on a global scale. Deadline for submissions is June 4.
|The Distill Prize for Clarity in Machine Learning|
Beginning in 2018, Distill prizes will be given annually for work done before January 1 of that year. The number given each year depends on the amount of outstanding work done. We aim to come to decisions by the end of February.
|Cascadia R Conf|
Portland, OR Conference is June 3. Deadline to submit a talk is April 24.
|Panels - SC17|
Denver, CO SC17 is The International Conference for High Performance Computing, Networking, Storage and Analysis, November 12-17. Deadline for panel submissions is April 24.
|Proposals | Dataverse Community Meeting 2017|
Cambridge, MA June 14-16 at Harvard University. Deadline for proposals is April 25.
|Computational Creativity & Games Workshop - Computational Creativity & Games Workshop|
Atlanta, GA June 19, an ICCC'17 Workshop. Deadline for paper submissions is April 25.
|10th Workshop on Hot Topics in Privacy Enhancing Technologies (HotPETs 2017)|
Minneapolis, MN Held in conjunction with the 17th Privacy Enhancing Technologies Symposium on July 21. Deadline for submissions is May 8.
|Call for Papers: Special Issue on Computational Propaganda and Political Big Data |
A special issue of the journal Big Data will be dedicated to computational propaganda, guest edited by Phil Howard and Gillian Bolsover. The deadline for submission is 1 June, 2017 for publication in December 2017.
|Monarq Incubator Founder Application|
Are you an early stage women-led company in the NYC area with a big vision? ... Applications will close on May 5.
|Google seeking input on next directions in CS Education Research|
Feel free to share this survey with others who may be interested in sharing their insights.
|Audible Metadata Prototyping Project |
NYC Media Lab is seeking a NYC-based university data science faculty member who can lead a group of 2-4 graduate students to complete a software engineering/data science project with Audible over the summer in 2017. The project budget is $25,000. The project will focus on experimenting with metadata extraction from a book’s manuscript.
|Call for Proposals, Health Data for Action|
The Robert Wood Johnson Foundation HD4A program will fund innovative research that uses the available data to answer important research questions. Applicants under this Call for Proposals (CFP) will write a proposal for a research study using data from either the Health Care Cost Institute or athenahealth. Successful applicants will be provided with access to these data, which are described in greater detail below. Deadline is May 24.
|NYU Center for Data Science News|
|Creating A Multi-Genre Corpus for Natural Language Inference|
|NYU Center for Data Science from April 14, 2017|
Although natural language processing (NLP) has made major strides in the last few years, to what extent can an NLP algorithm understand human sentences beyond a superficial read? Although they can computationally identify, count, or regurgitate individual words, phrases, and sentences, can they capture the meaning behind the words that they are handling?
These questions are at the heart of a fledgling sub-field within NLP called Natural Language Inference (NLI), where CDS professor Sam Bowman’s work is currently located.
|Tools & Resources|
|the bioinformatics chat |
|Roman Cheplyaka from April 16, 2017|
"The bioinformatics chat is a podcast about computational biology, bioinformatics, and next generation sequencing." ... "In this episode Mingfu Shao talks about Scallop, an accurate reference-based transcript assembler."
|An R package contain all baby names data from the SSA|
|GitHub - hadley from April 14, 2017|
"This package contains three datasets provided by the USA social security administration: babynames, applicants lifetables."
|A Dramatic Tour through Python’s Data Visualization Landscape (including ggpy and Altair)|
|yhat, Dan Saber from April 19, 2017|
"Indeed, I was so impressed by Altair that the original thesis of my post was going to be: 'Yo, use Altair.'"
"But then I began ruminating on my own Pythonic visualization habits, and — in a painful moment of self-reflection — realized I’m all over the place."
|Release of IPython 6.0|
|Project Jupyter from April 19, 2017|
"It is with great pleasure that today we released IPython 6.0 — almost a year after the 5.0 version.Users on Python 3.3 and above can get this latest version with all its new features by asking your package manager to upgrade IPython."
|A Large Self-Annotated Corpus for Sarcasm|
|arXiv, Computer Science > Computation and Language; Mikhail Khodak, Nikunj Saunshi, Kiran Vodrahalli from April 19, 2017|
"We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for sarcasm research and for training and evaluating systems for sarcasm detection. The corpus has 1.3 million sarcastic statements -- 10 times more than any previous dataset -- and many times more instances of non-sarcastic statements, allowing for learning in regimes of both balanced and unbalanced labels."
|Teaching Statistics: A Bag of Tricks (second edition)|
|Andrew Gelman, Statistical Modeling, Causal Inference, and Social Science blog from April 20, 2017|
"Hey! Deb Nolan and I finished the second edition of our book, Teaching Statistics: A Bag of Tricks."
|d3.annotation: Design & Modes|
|Susie Lu from April 20, 2017|
"I started this library by gathering examples of annotations that I liked. From those examples, the majority of use cases followed a pattern: a subject (the thing the annotation is annotating), a note, and a connector joining the note to the subject."
|Release 'open' data from their PDF prisons using tabulizer|
|rOpenSci, Thomas J. Leeper from April 18, 2017|
"There is no problem in science quite as frustrating as other peoples' data. Whether it's malformed spreadsheets, disorganized documents, proprietary file formats, data without metadata, or any other data scenario created by someone else, scientists have taken to Twitter to complain about it. As a political scientist who regularly encounters so-called "open data" in PDFs, this problem is particularly irritating."
|Announcing Datazar v2.0|
|Datazar Blog, Aman Tsegai from April 19, 2017|
"Now you can analyze any dataset using R and Python with the notebook or console interfaces right in your browser. All the computation is done on Datazar’s servers so you can literally do it using a Chromebook."
|Tenured and tenure track faculty positions|
Stanford University School of Medicine; Palo Alto, CA
|Full-time, non-tenured academic positions|
|Research Associate – Blockchain technology for Algorithmic Regulation and Compliance|
University College London; London, England
|Research Fellow (Data Science/Biomedical Engineering)|
Trinity College Dublin; Dublin, Ireland
University of Washington, Center for Collaborative Systems for Security, Safety, and Regional Resilience; Seattle, WA
|Postdoctoral Positions in the SNAP Group Machine Learning and Bioinformatics|
Stanford University; Palo Alto, CA
|Postdoctoral Research Associate, Princeton Neuroscience Institute|
Princeton University; Princeton, NJ
|Computational Biomechanics Post-Doc|
University of Virginia, Center for Applied Biomechanics; Charlottesville, VA
|Postdoctoral Associate | The IECA|
Stony Brook University, Alan Alda Center for Communicating Science; Stony Brook, NY
|Elevate 2-year postdoctoral fellowship|
Mitacs; Multiple Locations in Canada
|Postdoctoral Research Fellow for the “Responsible Terrorism Coverage”|
University of Mannheim, Mannheim Center for European Social Research; Mannheim, Germany
|Full-time positions outside academia|
Cloudera; San Francisco, CA
|Head of Data Engineering |
MassMutual; Boston, MA
Partnership on Artificial Intelligence to Benefit People and Society
Google DeepMind; London, England
Nokia Bell Labs; Cambridge, England
|Clinical Data Architect|
Science 37; Playa Vista, CA
|Principal Data Scientist|
Microsoft, The Decision Service; New York, NY, or Redmond, WA
|Documentation Writer for Open Data Kit|
Nafundi; Seattle, WA
|Research Scientist (Machine Learning)|
NVIDIA; Santa Clara, CA and other locations
|Lead Data Scientist|
iam Bank; Chicago, IL
|Database Kernel Engineer|
Splice Machine; St. Louis, MO, or San Francisco, CA
|Senior Backend Engineer|
Chartbeat; New York, NY
|Senior Applied Scientist|
Microsoft, Windows and Devices Group; Redmond, WA
|Senior Analyst (2)|
NYC Department of Housing Preservation and Development; New York, NY
|Data Scientist – Higher Education Analytics|
HelioCampus; Fairfax, VA
|NEON Observatory Director/Chief Scientist |
Battelle; Boulder, CO
|Internships and other temporary positions|
|Project Manager - The AI Index|
Stanford University; Palo Alto, CA
|Social Media intern|
Rutgers University, School of Communication & Information; New Brunswick, NJ
| College Intern - Data Science - Innovation & Performance |
City of Pittsburgh; Pittsburgh, PA
|Machine Learning/MIR Research Intern|
Sunhouse; Long Island City, NY