［专访］用大数据解放科学家,学术更简单

大数据文摘

发布于 2018-05-21 11:36:26

7680

大数据文摘原创作品，欢迎个人转发朋友圈；其他机构、自媒体转载，务必后台留言，申请授权。

采访｜刘小娇孙强阚玺

撰稿｜孙强阚玺魏子敏郭曼桐

本期原点第一次尝试了全英采访，小编们做了很充分的准备，并且邀请了医疗专栏总编、现任职美国国立癌症研究所的孙强老师加盟。本着原创的精神也尽量以Q&A的形式，完整的用英文将这期稿子呈现给大家，希望给您带来第一手的阅读体验。原点致力于采访到国内外最优秀的创业团队并跟读者们分享最精彩的创业故事，也欢迎您向我们推荐海外优秀创业公司或是您想了解的创业故事。中英文法文采访撰稿小编们都愿意尝试，请尽情推荐自荐！所有采访均为无偿行为，只为提供更好的阅读体验！

毕业季不仅有聚会、合照、啤酒、大哭，还有让所有人抓狂的毕业论文。而毕业论文中的文献综述工作是让小编印象最深刻的恐怖经历。相信很多读者也一样经历过寻找、逐一阅读并整合一个领域成百上千篇论文的经历。本期原点的创业嘉宾也一样，在完成博士论文期间萌生了用计算机和算法代替学者们完成这一枯燥工作的想法。这一想法从诞生到初步完成花费了五年时间，现在，生物医学领域的学者们已经基本可以利用这一工具追溯一个领域的研究历史及重要论文了。

Sam Molyneux给自己的这个网站取名位Sciencescape，希望能够建立起一种逃离传统学术搜索类网站的用户体验，他给自己网站的一个定义是“twitter-like“，从外观和使用感受上都想给读者一种全新的学术体验。

大数据文摘原点今日有幸跟大家分享对Sciencescape创始人Sam Molyneux的访谈，带您一起了解这位医学家的创业经历，看看他如何在做学术的同时赚取自己的一桶金。

观看以下视频先看看Sciencescape在做什么（02:12）

英语阅读训练马上开始

Are You Ready?

Q & A

1. Canyou briefly introduce Sciencescape? Whatare the factors/motivations that drive you to set up Sciencescape?

Briefly,we set out to build Sciencescape which is a place on the web that connects youto breaking research as happens around the world, and also gives you somepowerful tool to explore the entire history of Biomedicine.

The motivations for building Sciencescape, were a number of experiences that I hadas a cancer genomics researcher, through a PhD that I was working on a numberof years ago, essentially with the challenges around how difficult it is tobrowse and explore the history of any fields of research or fields of researchyou were working with, identifying landmark papers, tracing the connectionsamong the papers through history. And then second to that, how challenging itis, more nearly possible to be to actually stream and follow the research in apersonalized manner. Based on that, we looked at the tools were in market atthe time, which were essentially just search engines with email alerts and RSSFeed that you could link to from journal tables of content.

We envision a platform thatyou could walk into and follow anything in your world of research whether it ispeople, genes, diseases, proteins, products, places, affiliations, etc., anyentity, among millions of potential entities in the world of research thatrelate to you and in a “twitter-like” experience, research which stream to you,through the web and through the mobile, and then there will be some powerfulinterfaces for explore history. That is something we envisioned, and 5 yearslater, we finally made some progress with that.

- Technical & Product-Related Questions

2. Based on your short intro video and online information, we’ve learned that Sciencescape leverages the advantages of Natural Language Processing and Content Identification techniques. How to apply those techniques to your platform to provide real-time customized recommendations for users?

A really good call of what we work on Sciencescape beyond the platform and product user experience, the core work we work on is a knowledge graph, which we have been building for almost half decades now. Essentially the knowledge graph concept is a network of linked entities, to which you organize content around and you organize related to each other, applying that concept to the world of biomedical literature. We have set up on this project to be able to in an automatic fashion using machine learning intelligence and identify all of these entities that are related to any paper. So those entities come from essentially informatics classes, some of these I just mentioned. To do that, you have to apply a family of nature language processing with different algorithms, versus genes, versus disease, versus an organism, versus a product, etc., and a number of other machine learning approaches, such as classification algorithms, to do with disambiguating authors or affiliations. We apply many different machine learning approaches to the universal challenge of tagging papers correctly with entities and concept.

3. What kinds of metrics are used to value the paper? Through the numbers of citation? What else? How to evaluate the importance of paper？

A number of years ago, we looked at the metrics field and we concluded that there is a lot of work to do with citation work alone – a lot of work that are not fully productized. We thought to identify a universal metrics that would allow us to organize papers relative to each other, and organize entities relative to each other. We landed on a various of pagerank algorithm, which is well known as the algorithm that google started on. We identify the variant which is called eigenfactor, which is very similar to pagerank, we just applied the citation network. And if you calculate that on a very large an accepted citation network – today I believe we have the 4th largest in the world. If you calculated on that, you will end up with robust metrics at the article level to use any information which is retrievable, search rankings and other listing of the papers. But those metrics also propagates at the level of entity, so you can calculate the institute level eigenfactor, or the author level eigenfactor, or the gene level eigenfactor. So we ended up with a flexible, powerful and robust metric, which propagates through our system and allows much of these experiences to keep figured in list things. A short answer is to use eigenfactor.

That is only where we started, so we thought we identify base metric to build our initial system on and where we are going with is of course a diverse set of metrics, possibly hundreds of metrics, pulled from through the web all the metrics, as well as internal network metrics to be able to improve the experience in a personalized way for users and allow to explicitly rank the papers and entities against. We actually get a lot of progress on that as well.

4. Some of the algorithm, such as page ranks, you just mentioned, are widely applied by other companies in industry as well. Compared to those big article search engines, your competitors, such as google scholar, what are your competitive advantages?

First, we do not spend a lot of time thinking about our competitors, we spent a lot of time thinking about how to build a different transformed experience for streaming and exploring papers through history. I don’t think there are any products, any platforms and any services there that fill that need, which is why is worth passion on it and why we work intensively on this. We think it is a unifying platform – it is not necessarily a distractive platform, because the function is not well filled. So thinking about search, search is an effective way that literature is access. We think there is a great search there, such as google scholars, is the most extensive search on citation. But search is a solved problem for papers you know exist, if you can describe a paper well. If you have fragment of the title, or part of whole text abstract, or a couple of the authors, it is trivial to use searches on that paper. You can use pubmed and google scholar to do this. There is great enterprise search as well. We don’t work on and think about search too much, we intend to have good search on our platform. But we think about streaming and discovering a lot more and we think the best opportunity really is to improve scientific leadership experience globally.

5. Through some other interview to you, we got to know that your company also had some partnership relation with some individuals or organizations, such as publishers. Compared to the product you mentioned before, this is totally a different business model. Can you talk a little bit more on that?

We have been building our partnership with publishers for quite a while. Essentially, we worked with publishers to help increase the discoverability of article on full text bases. We think publisher is really important in terms of on time delivery of weekly or monthly articles that researchers searched for. In the current marketplace, that is essentially not the case. The researchers are unserved and publishers are unserved. So we built a program over the last a couple of years for publishers. We accept their content and mine it in a very deep way to increase the discoverability by applying the family of machine learning intelligence algorithms to teach one of the articles. We are open to publisher industrial and value the relationship that we are working together.

- Operation and Strategy Questions

6. Has the company generated revenue so far? How do you monetize your business model?

We are in Series A stage. We had some fantastic investors; we have a large Canadian group investors, who we built the company with during Series A stage. During the next stage, we have venture capital firms in Canada, Chinese Hong Kong and US. We have 30 people in the company and we are located in a beautiful start-up based downtown Toronto. Our company is dominant by data science and engineering and we just start to build up our product team, in terms of product management, marketing, design. Our company has gone through tremendous transformation internally. A lot of workload will start to expand externally over the next 6 months.

7. What is the biggest challenge your company came across so far? How did you deal with the challenge?

Ourchallenge is pretty similar to any other start-up companies. Fund raising isalways a challenge. We are in very important but specialized market. We are notproviding a consumer-facing application in the sense that we offered to thebroad consumer market. We have consumer-facing application with respect ofresearchers and leaders of research literature. So there is not 1 billionreaders of scientific literature – it is just a small number. But the people whoinvolved in reading the literature are tremendous important from what theycontributed to society and the budget they command in terms of researchdollars, in terms of driving medical and other society progress. Because ofthat specialized nature of market, sometimes it is difficult to persuade peopleto invest in the company. I think at this time, we have achieved a lot and weovercome a lot of those communication limitations.

8. Our big data digest has over 150 thousands followers who are interested or currently working in various big data fields. Is China going to be one of your future markets? How can we help you to make influence using our effective connections and resources?

We see tremendous opportunity to help increase discoverability of we search for Chinese researchers, a market growing really fast. One of the companies we have partnership with, focus on some of these emergent research markets and some of these data shows the growth of these markets is just explosive. So China is definitely a big market for us and India as well. If any opportunities help us to grow our usership through networks, we will be very graceful to work with you.

We have Institutional Ambassador Program – researchers who interested can join our company and work from abroad, to promote their research through our product. We can provide contact info for the readers who are interested in this program.

- Other questions related to market future, startups management and entrepreneurship

9. As a researcher, how do you envision the cancer therapy in future?

I think the program of personalized medicine and Cancer Genomics is a long-term program. So it is going to take many years to elucidate all the long tails of changes in each cancer genome. Each tumor is very different. Cells in the tumor presents lot of differences. heterogenous changes are the most constant thing in cancer genome. But looking into this in another way, despite of the large number of mutations in different cancer genes, all the changes are kind of resolved into a relatively small number of pathways. If we have enough pathway-targeting drugs and we can use them in a combinatorial fashion we can systematically and effectively suppress cancer across multiple pathways. Of course we also need to consider and take care of cancer evolution, but I think this is a productive approach. Also recent cancer immunotherapy results are staggering. Considering the nature of cancer and cancer genome evolution, it is definitely better to have multiple-line of therapies. If we can combine them we can probably suppress cancer in a long run.

10. Do you have any plans to expand your business to clinical informatics? Or other related medical informatics?

I think that is interesting. We have a lot of people who have background of bioinformatics and clinical/medical informatics as well. We are passionate to those approaches. But as a start-up company, we must focus on a main problem, a main market and get really good with something within that space. So we think that the data we are yielding in the process of mining scientific articles to create our knowledge fast can probably be used in precision medicine and probably be used in bioinformatics as another data source to analyze. We will probably focus on providing a transformed data set that can be used by any academics. We are working on for next year the ability for academic to use our data and build on it.

尽管Sam认为Sciencescape尚在初级阶段，还有很多潜力有待深度挖掘，但从其言语间不难听出他为Sciencescape所描绘的宏伟蓝图。他希望用户在文献搜索时，能通过Sciencescape徜徉在生物医学的知识世界，更高效地随手触及从古至今的文献资料。

然而，不光是Sciencescape界面的用户体验还是目前的文献数量都还不甚完善，要想成为与Google scholar相媲美、并有自己独特优势的学术文献搜索工具，Sciencescape还有很长的路要走。

面对潜力巨大的中国市场，Sam明确表达了他想要接触并合作的渴望。但至于能否像Wikipedia一样准确进行非英语语言的翻译工作，Sam只是表达了愿望，却并未将这项工作列入近期的议事日程。

也许对于Sciencescape这样一家创业公司来说，能够“标新立异”可以算是一个很好的开端，但未来能走多远、走多广，还要看其对于市场的把握以及对于主体业务的探索。

大数据文摘原点栏目

“原点”

坐标中的定位点、起点，万事开头难，但只要起步，一切皆有可能。2015年初, 大数据文摘“原点”栏目成立。这是针对大数据初创公司的采访栏目。通过在线采访的方式，对与大数据相关的初创团队进行采访，介绍项目、技术、商业模式。初期，我们的采访对象是美国等发达国家的大数据相关的初创企业，他们一般已经获得天使或A轮投资。

我们希望通过“原点”，为读者打开一扇门，看到国外“大数据”初创公司是如何启动、运营的，看到这些创业公司后面的人、团队有着怎样一种情怀。同时我们也会真诚帮助那些愿意在“大数据文摘”平台接受采访、分享创业经验的企业。只要你的公司有价值，愿意与读者分享。我们就愿意免费采访您。

大数据文摘从2013年7月成立至今，每日坚持分享优质文章，从未间断。已有读者超过13万，是微信上最有影响力的大数据自媒体。原点栏目组由在北美各大应用领域（医疗，电商，咨询等）从事与大数据相关工作的具有高学历及数据专业背景的成员组成。

我们欢迎更多乐于分享的创业公司与我们邮件联系，邮件中请附上您的公司简介及相关诉求，邮件发送至origin@bigdatadigest.cn

原点栏目成员

有意联系栏目组成员的朋友，请给“大数据文摘”后台留言，附自我介绍及微信ID，谢谢。

－－－－阚玺－－－刘小娇－－－－魏子敏

－－－于丽君－－－郭曼桐－－－闫瑾

阚小玺

原点栏目主编。美国伦斯勒理工学院决策科学专业博士学位。现任RetailMeNot, Inc.的资深决策分析师，负责运用统计测试及最优化分析提供产品方案的决策建议。希望通过原点平台与致力于大数据领域创业的朋友互相交流学习。

刘小娇

盖洛普咨询高级统计分析师及品牌战略咨询师、《阿里商业评论》特约撰稿人。现居美国华盛顿特区。任盖洛普多项全球调查及美国居民幸福指数实时调查首席分析师。合作撰文内容涵盖政治、经济、国际金融、健康等领域。从事各项数据分析及可视化工作。如有对移动医疗领域、统计建模、大数据应用及可视化感兴趣，有志创业的同仁，请给她留言。

魏子敏

中国香港中文大学新闻学硕士。硕士期间利用大数据分析媒体报道的政治偏向，2012年美国大选期间参与美国加州民主党竞选工作，搜集选民数据并统计数据反映的政治走向。希望使用更科学的数据做更好的新闻，望与更多志同道合的朋友共同交流进步。

于丽君

本科硕士毕业于清华大学数学系，硕士研究课题为图像修补问题建模，目前为美国Case Western Reserve University应用数学在读博士，研究方向为贝叶斯方法反问题建模，博士研究课题为利用MEG（脑磁成像技术）时序信号对大脑活动进行定位，对数学建模、机器学习、人工智能以及图像处理等方面有广泛兴趣，希望可以通过原点栏目组，结识更多相关领域的朋友以及有志于创业的同仁，互相交流进步。

郭曼桐

清华新闻与传播学院2005级本科，毕业后在新华社工作，主要参与CNC World英语台的电视节目制作。曾被新华社派驻至美国华盛顿任驻外记者。曾参与美国大选、美国政府关门风波、IMF和World Bank年会等报道。

闫小瑾

原点栏目采访联络人, 04年兰州市高考理科状元，清华数学本科，美国马里兰大学数理统计博士。博士论文研究眼动数据(eye tracking)在医疗、市场营销方面的应用；同时关注HIV药品的临床数据分析。目前在美国某大型银行做量化分析，构建房地产贷款的信用风险预测模型(credit risk model)。期待通过原点认识更多有志于大数据相关产业创业的朋友。同时现正组建自己的团队，为顾客提供大数据咨询服务(分析，建模，解决方案，模型测试等)，有项目需求者请联系: smallbear@bigdatadigest.cn.

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2015-07-04，如有侵权请联系 cloudcommunity@tencent.com 删除

大数据