采访|刘小娇 孙强 阚玺
撰稿|孙强 阚玺 魏子敏 郭曼桐
1. Can you briefly introduce Sciencescape? What are the factors/motivations that drive you to setup Sciencescape?
Briefly, we set out to build Sciencescape which is a place on the web that connects you to breaking research as happens around the world, and also gives you some powerful tool to explore the entire history of Biomedicine.
The motivations for building Sciencescape, were a number of experiences that I had as a cancer genomics researcher, through a PhD that I was working on a number of years ago, essentially with the challenges around how difficultitis to browse and explore the history of any fields of research or fields of research you were working with, identifying landmark papers, tracing the connections among the papers through history. And then second to that, how challenging it is, more nearly possible to be to actually stream and follow the research in a personalized manner. Based on that, we looked at the tools were in market at the time, which were essentially just search engines with email alerts and RSS Feed that you could link to from journal tables of content.
We envision a platform that you could walk into and follow anything in your world of research whether it is people, genes, diseases, proteins, products, places, affiliations, etc., anyentity, among millions of potential entities in the world of research that relate to you and ina“twitter-like” experience, research which stream to you, through the web and through the mobile, and then there will be some powerful interfaces for explore history. That is something we envisioned, and 5 years later, we finally made some progress with that.
-Technical & Product-Related Questions
2. Based on your short intro video and online information, we’ve learned that Sciencescape leverages the advantages of Natural Language Processing and Content Identification techniques. How to apply those techniques to your platform to provide real-time customized recommendations for users?
A really good call of what we work on Sciencescape beyond the platform and product user experience, the core work we work on is a knowledge graph, which we have been building for almost half decades now. Essentially the knowledge graph concept is a network of linked entities, to which you organize content around and you organize related to each other, applying that concept to the world of biomedical literature. We have set up on this project to be able to in an automatic fashion using machine learning intelligence and identify all of these entities that are related to any paper. So those entities come from essentially informatics classes, some of these I just mentioned. To do that, you have to apply a family of nature language processing with different algorithms, versusgenes, versus disease, versus an organism, versus a product, etc., and a number of other machine learning approaches, such as classification algorithms, to do with disambiguating authors or affiliations. We apply many different machine learning approaches to the universal challenge of tagging papers correctly with entities and concept.
3. What kinds of metrics are used to value the paper? Through the numbers of citation? Whatelse? How to evaluate the importance of paper?
A number of years ago, we looked at the metrics field and we concluded that there is a lot of work to do with citation work alone–a lot of work that are not fully productized. We thought to identify a universal metrics that would allow us to organize papers relative to each other, and organize entities relative to each other. We landed on a various of pagerank algorithm, which is well known as the algorithm that google started on. We identify the variant which is called eigenfactor, which is very similar to pagerank, we just applied the citation network. And if you calculate that on a very large an accepted citation network–today I believe we have the 4th largest in the world. If you calculated on that, you will end up with robust metrics at the article level to use any information which is retrievable, search rankings and other listing of the papers. But those metrics also propagates at the level of entity, so you can calculate the institute level eigenfactor, or the author level eigenfactor, or the gene level eigenfactor. So we ended up with a flexible, powerful and robust metric, which propagates through our system and allows much of these experiences to keep figured in list things. A short answer is to use eigenfactor.
That is only where we started, so we thought we identify base metric to build our initial system on and where we are going with is of course a diverse set of metrics, possibly hundreds of metrics, pulled from through the web all the metrics, as well as internal network metrics to be able to improve the experience in a personalized way for users and allow to explicitly rank the papers and entities against. We actually get a lot of progress on that as well.
4. Some of the algorithm, such as page ranks, you just mentioned, are widely applied by other companies in industry as well. Compared to those big article search engines, your competitors, such as google scholar, what are your competitive advantages?
First, we do not spend a lot of time thinking about our competitors, we spent a lot of time thinking about how to build a different transformed experience for streaming and exploring papers through history. I don’t think there are any products, any platforms and any services there that fill that need, which is why is worth passion on it and why we work intensively on this. We think it is a unifying platform–it is not necessarily a distractive platform, because the function is not well filled. So thinking about search, search is an effective way that literature is access. We think there is a great search there, such as google scholars, is the most extensive search on citation. But search is a solved problem for papers you know exist, if you can describe a paper well. If you have fragment of the title, or part of whole text abstract, or a couple of the authors, it is trivial to use searches on that paper. You can use pubmed and google scholar to do this. There is great enterprise search as well. We don’t work on and think about search too much, we intend to have good search on our platform. But we think about streaming and discovering a lot more and we think the best opportunity really is to improve scientific leadership experience globally.
5. Through some other interview to you, we got to know that your company also had some partnership relation with some individuals or organizations, such as publishers. Compared to the product you mentioned before, this is totally a different business model. Can you talk a little bit more on that?
We have been building our partnership with publishers for quite a while. Essentially, we worked with publishers to help increase the discover ability of article on full text bases. We think publisher is really important in terms of on time delivery of weekly or monthly articles that researchers searched for. In the current market place, that is essentially not thecase. There searchers are unserved and publishers are unserved. So we built a program over the last a couple of years for publishers. We accept their content and mine it in a very deep way to increase the discover ability by applying the family of machine learning intelligence algorithms to teach one of the articles. We are open to publisher industrial and value the relationship that we are working together.
-Operation and Strategy Questions
6. Has the company generated revenue so far? How do you monetize your business model?
We are in Series Astage. We had some fantastic investors; we have a large Canadian group investors, who we built the company with during Series Astage. During the next stage, we have venture capital firms in Canada, HongKong and US. We have 30 people in the company and we are located in a beautiful start-up based downtown Toronto. Our company is dominant by data science and engineering and we just start to buildup our product team, interms of product management, marketing, design. Our company has gone throught remendous transformation internally. A lot of work load will start to expand externally over the next 6 months.
7. What is the biggest challenge your company came across so far? How did you deal with the challenge?
Our challenge is pretty similar to any other start-up companies. Fund raising is always a challenge. We are in very important but specialized market. We are not providing a consumer-facing application in the sense that we offered to the broad consumer market. We have consumer-facing application with respect of researchers and leaders of research literature. So there is not 1 billion readers of scientific literature–it is just a small number. But the people who involved in reading the literature are tremendous important from what they contributed to society and the budget they command in terms of research dollars, in terms of driving medical and other society progress. Because of that specialized nature of market, sometimes it is difficult to persuade people to invest in the company. I think at this time, we have achieved a lot and we overcome a lot of those communication limitations.
8. Our big data digest has over 150 thousands followers who are interested or currently working in various big data fields. Is China going to be one of your future markets? How can we help you to make influence using our effective connections and resources?
We see tremendous opportunity to help increased is coverability of we search for Chinese researchers, a market growin greally fast. One of the companies we have partnership with, focus on some of these emergent research markets and some of these data shows the growth of these markets is just explosive. So China is definitely a big market for us and India as well. If any opportunities help us to grow our usership through networks, we will be very graceful to work with you.
We have Institutional Ambassador Program–researchers who interested can join our company and work from abroad, to promote their research through ourproduct. We can provide contact info for the readers who are interested in this program.
-Other questions related to market future, start ups management and entrepreneurship
9. As are searcher, how do you envision the cancer therapy in future?
I think the program of personalized medicine and Cancer Genomics is a long-term program. So it is going to take many years to elucidate all the long tails of changes in each cancer genome. Each tumor is very different. Cells in the tumor presents lot of differences. heterogenous changes are the most constant thing in cancer genome. But looking into this in another way, despite of the large number of mutations in different cancer genes, all the changes are kind of resolved in to a relatively small number of pathways. If we have enough pathway-targeting drugs and we can use them in a combinatorial fashion we can systematically and effectively suppress cancer across multiple pathways. Of course we also need to consider and take care of cancer evolution, but I think this is a productive approach. Also recent cancer immunotherapy results are staggering. Considering the nature of cancer and cancer genome evolution, it is definitely better to have multiple-line of therapies. If we can combine them we can probably suppress cancer in along run.
10. Do you have any plans to expand your business to clinical informatics? Or other related medical informatics?
I think that is interesting. We havea lot of people who have background of bioinformatics and clinical/medical informatics as well. We are passionate to those approaches. But as a start-up company, we must focus on a main problem, a main market and get really good with something within that space. So we think that the data we are yielding in the process of mining scientific articles to create our knowledge fast can probably be used in precision medicine and probably be used in bioinformaticsas another data source to analyze. We will probably focus on providing a transformed data set that can be used by any academics. We are working on for next year the ability for academic to use our data and build on it.
于丽君：本科硕士毕业于清华大学数学系，硕士研究课题为图像修补问题建模，目前为美国Case Western Reserve University应用数学在读博士，研究方向为贝叶斯方法反问题建模，博士研究课题为利用MEG(脑磁成像技术)时序信号对大脑活动进行定位，对数学建模、机器学习、人工智能以及图像处理等方面有广泛兴趣，希望可以通过原点栏目组，结识更多相关领域的朋友以及有志于创业的同仁，互相交流进步。
郭曼桐：清华新闻与传播学院2005级本科，毕业后在新华社工作，主要参与CNC World英语台的电视节目制作。曾被新华社派驻至美国华盛顿任驻外记者。曾参与美国大选、美国政府关门风波、IMF和World Bank年会等报道。
闫小瑾：原点栏目采访联络人, 04年兰州市高考理科状元，清华数学本科，美国马里兰大学数理统计博士。博士论文研究眼动数据(eye tracking)在医疗、市场营销方面的应用;同时关注HIV药品的临床数据分析。目前在美国某大型银行做量化分析，构建房地产贷款的信用风险预测模型 (credit risk model)。期待通过原点认识更多有志于大数据相关产业创业的朋友。同时现正组建自己的团队，为顾客提供大数据咨询服务(分析，建模，解决方案，模型测试等)，有项目需求者请联系: firstname.lastname@example.org.