采访｜刘小娇 孙强 阚玺
撰稿｜孙强 阚玺 魏子敏 郭曼桐
本期原点第一次尝试了全英采访，小编们做了很充分的准备，并且邀请了医疗专栏总编、现任职美国国立癌症研究所的孙强老师加盟。本着原创的精神也尽量以Q&A的形式，完整的用英文将这期稿子呈现给大家，希望给您带来第一手的阅读体验。 原点致力于采访到国内外最优秀的创业团队并跟读者们分享最精彩的创业故事，也欢迎您向我们推荐海外优秀创业公司或是您想了解的创业故事。中英文法文采访撰稿小编们都愿意尝试，请尽情推荐自荐！ 所有采访均为无偿行为，只为提供更好的阅读体验！
Are You Ready?
Q & A
1. Canyou briefly introduce Sciencescape? Whatare the factors/motivations that drive you to set up Sciencescape?
Briefly,we set out to build Sciencescape which is a place on the web that connects youto breaking research as happens around the world, and also gives you somepowerful tool to explore the entire history of Biomedicine.
The motivations for building Sciencescape, were a number of experiences that I hadas a cancer genomics researcher, through a PhD that I was working on a numberof years ago, essentially with the challenges around how difficult it is tobrowse and explore the history of any fields of research or fields of researchyou were working with, identifying landmark papers, tracing the connectionsamong the papers through history. And then second to that, how challenging itis, more nearly possible to be to actually stream and follow the research in apersonalized manner. Based on that, we looked at the tools were in market atthe time, which were essentially just search engines with email alerts and RSSFeed that you could link to from journal tables of content.
We envision a platform thatyou could walk into and follow anything in your world of research whether it ispeople, genes, diseases, proteins, products, places, affiliations, etc., anyentity, among millions of potential entities in the world of research thatrelate to you and in a “twitter-like” experience, research which stream to you,through the web and through the mobile, and then there will be some powerfulinterfaces for explore history. That is something we envisioned, and 5 yearslater, we finally made some progress with that.
- Technical & Product-Related Questions
2. Based on your short intro video and online information, we’ve learned that Sciencescape leverages the advantages of Natural Language Processing and Content Identification techniques. How to apply those techniques to your platform to provide real-time customized recommendations for users?
A really good call of what we work on Sciencescape beyond the platform and product user experience, the core work we work on is a knowledge graph, which we have been building for almost half decades now. Essentially the knowledge graph concept is a network of linked entities, to which you organize content around and you organize related to each other, applying that concept to the world of biomedical literature. We have set up on this project to be able to in an automatic fashion using machine learning intelligence and identify all of these entities that are related to any paper. So those entities come from essentially informatics classes, some of these I just mentioned. To do that, you have to apply a family of nature language processing with different algorithms, versus genes, versus disease, versus an organism, versus a product, etc., and a number of other machine learning approaches, such as classification algorithms, to do with disambiguating authors or affiliations. We apply many different machine learning approaches to the universal challenge of tagging papers correctly with entities and concept.
3. What kinds of metrics are used to value the paper? Through the numbers of citation? What else? How to evaluate the importance of paper？
A number of years ago, we looked at the metrics field and we concluded that there is a lot of work to do with citation work alone – a lot of work that are not fully productized. We thought to identify a universal metrics that would allow us to organize papers relative to each other, and organize entities relative to each other. We landed on a various of pagerank algorithm, which is well known as the algorithm that google started on. We identify the variant which is called eigenfactor, which is very similar to pagerank, we just applied the citation network. And if you calculate that on a very large an accepted citation network – today I believe we have the 4th largest in the world. If you calculated on that, you will end up with robust metrics at the article level to use any information which is retrievable, search rankings and other listing of the papers. But those metrics also propagates at the level of entity, so you can calculate the institute level eigenfactor, or the author level eigenfactor, or the gene level eigenfactor. So we ended up with a flexible, powerful and robust metric, which propagates through our system and allows much of these experiences to keep figured in list things. A short answer is to use eigenfactor.
That is only where we started, so we thought we identify base metric to build our initial system on and where we are going with is of course a diverse set of metrics, possibly hundreds of metrics, pulled from through the web all the metrics, as well as internal network metrics to be able to improve the experience in a personalized way for users and allow to explicitly rank the papers and entities against. We actually get a lot of progress on that as well.
4. Some of the algorithm, such as page ranks, you just mentioned, are widely applied by other companies in industry as well. Compared to those big article search engines, your competitors, such as google scholar, what are your competitive advantages?
First, we do not spend a lot of time thinking about our competitors, we spent a lot of time thinking about how to build a different transformed experience for streaming and exploring papers through history. I don’t think there are any products, any platforms and any services there that fill that need, which is why is worth passion on it and why we work intensively on this. We think it is a unifying platform – it is not necessarily a distractive platform, because the function is not well filled. So thinking about search, search is an effective way that literature is access. We think there is a great search there, such as google scholars, is the most extensive search on citation. But search is a solved problem for papers you know exist, if you can describe a paper well. If you have fragment of the title, or part of whole text abstract, or a couple of the authors, it is trivial to use searches on that paper. You can use pubmed and google scholar to do this. There is great enterprise search as well. We don’t work on and think about search too much, we intend to have good search on our platform. But we think about streaming and discovering a lot more and we think the best opportunity really is to improve scientific leadership experience globally.
5. Through some other interview to you, we got to know that your company also had some partnership relation with some individuals or organizations, such as publishers. Compared to the product you mentioned before, this is totally a different business model. Can you talk a little bit more on that?
We have been building our partnership with publishers for quite a while. Essentially, we worked with publishers to help increase the discoverability of article on full text bases. We think publisher is really important in terms of on time delivery of weekly or monthly articles that researchers searched for. In the current marketplace, that is essentially not the case. The researchers are unserved and publishers are unserved. So we built a program over the last a couple of years for publishers. We accept their content and mine it in a very deep way to increase the discoverability by applying the family of machine learning intelligence algorithms to teach one of the articles. We are open to publisher industrial and value the relationship that we are working together.
- Operation and Strategy Questions
6. Has the company generated revenue so far? How do you monetize your business model?
We are in Series A stage. We had some fantastic investors; we have a large Canadian group investors, who we built the company with during Series A stage. During the next stage, we have venture capital firms in Canada, Hong Kong and US. We have 30 people in the company and we are located in a beautiful start-up based downtown Toronto. Our company is dominant by data science and engineering and we just start to build up our product team, in terms of product management, marketing, design. Our company has gone through tremendous transformation internally. A lot of workload will start to expand externally over the next 6 months.
7. What is the biggest challenge your company came across so far? How did you deal with the challenge?
Ourchallenge is pretty similar to any other start-up companies. Fund raising isalways a challenge. We are in very important but specialized market. We are notproviding a consumer-facing application in the sense that we offered to thebroad consumer market. We have consumer-facing application with respect ofresearchers and leaders of research literature. So there is not 1 billionreaders of scientific literature – it is just a small number. But the people whoinvolved in reading the literature are tremendous important from what theycontributed to society and the budget they command in terms of researchdollars, in terms of driving medical and other society progress. Because ofthat specialized nature of market, sometimes it is difficult to persuade peopleto invest in the company. I think at this time, we have achieved a lot and weovercome a lot of those communication limitations.
8. Our big data digest has over 150 thousands followers who are interested or currently working in various big data fields. Is China going to be one of your future markets? How can we help you to make influence using our effective connections and resources?
We see tremendous opportunity to help increase discoverability of we search for Chinese researchers, a market growing really fast. One of the companies we have partnership with, focus on some of these emergent research markets and some of these data shows the growth of these markets is just explosive. So China is definitely a big market for us and India as well. If any opportunities help us to grow our usership through networks, we will be very graceful to work with you.
We have Institutional Ambassador Program – researchers who interested can join our company and work from abroad, to promote their research through our product. We can provide contact info for the readers who are interested in this program.
- Other questions related to market future, startups management and entrepreneurship
9. As a researcher, how do you envision the cancer therapy in future?
I think the program of personalized medicine and Cancer Genomics is a long-term program. So it is going to take many years to elucidate all the long tails of changes in each cancer genome. Each tumor is very different. Cells in the tumor presents lot of differences. heterogenous changes are the most constant thing in cancer genome. But looking into this in another way, despite of the large number of mutations in different cancer genes, all the changes are kind of resolved into a relatively small number of pathways. If we have enough pathway-targeting drugs and we can use them in a combinatorial fashion we can systematically and effectively suppress cancer across multiple pathways. Of course we also need to consider and take care of cancer evolution, but I think this is a productive approach. Also recent cancer immunotherapy results are staggering. Considering the nature of cancer and cancer genome evolution, it is definitely better to have multiple-line of therapies. If we can combine them we can probably suppress cancer in a long run.
10. Do you have any plans to expand your business to clinical informatics? Or other related medical informatics?
I think that is interesting. We have a lot of people who have background of bioinformatics and clinical/medical informatics as well. We are passionate to those approaches. But as a start-up company, we must focus on a main problem, a main market and get really good with something within that space. So we think that the data we are yielding in the process of mining scientific articles to create our knowledge fast can probably be used in precision medicine and probably be used in bioinformatics as another data source to analyze. We will probably focus on providing a transformed data set that can be used by any academics. We are working on for next year the ability for academic to use our data and build on it.
－－－于丽君－－－ 郭曼桐 －－－ 闫瑾
本科硕士毕业于清华大学数学系，硕士研究课题为图像修补问题建模，目前为美国Case Western Reserve University应用数学在读博士，研究方向为贝叶斯方法反问题建模，博士研究课题为利用MEG（脑磁成像技术）时序信号对大脑活动进行定位，对数学建模、机器学习、人工智能以及图像处理等方面有广泛兴趣，希望可以通过原点栏目组，结识更多相关领域的朋友以及有志于创业的同仁，互相交流进步。
清华新闻与传播学院2005级本科，毕业后在新华社工作，主要参与CNC World英语台的电视节目制作。曾被新华社派驻至美国华盛顿任驻外记者。曾参与美国大选、美国政府关门风波、IMF和World Bank年会等报道。
原点栏目采访联络人, 04年兰州市高考理科状元，清华数学本科，美国马里兰大学数理统计博士。博士论文研究眼动数据(eye tracking)在医疗、市场营销方面的应用；同时关注HIV药品的临床数据分析。目前在美国某大型银行做量化分析，构建房地产贷款的信用风险预测模型(credit risk model)。期待通过原点认识更多有志于大数据相关产业创业的朋友。同时现正组建自己的团队，为顾客提供大数据咨询服务(分析，建模，解决方案，模型测试等)，有项目需求者请联系: firstname.lastname@example.org.
原文发布于微信公众号 - 大数据文摘（BigDataDigest）