While solving such tasks you always have to deal with large amounts of data in different formats. It often involves implementing some converters etc. For example, in the Yandex contest I've tried to apply SVMRank algorithm to the given data using SVMLight first. That forced me to write a small program for sorting input data by query id because of SVMLight requirements. Then, of course, I've added ability to restore original data order to my program because results are checked by the contest automatic checking system which knows anything about me reordering the data. Then I've decided to apply PCA to the input data before applying SVMRank algorithm (warning! due to the unusual feature value distributions result was terrible!). MATLAB contains freely available PCA implementation, so I've added ability to export data to format supported by MATLAB and import it back to my program. Then I repeated the following sequence of actions several times (for each number of selected PCA components): convert data to MATLAB, load attributes, apply PCA (fortunately, first 3 steps should be done only once), save first K attributes, convert back to SVMLight format, sort by query id, learn SVMRank model. Then I did the same for the test data except of replacing model learning with rank prediction and original order restoring. Of course, I could make a script from that action sequence and run it several times, but it has it's own drawbacks. For example, I know nothing about integrating external applications with MATLAB and learning that stuff can only increase amount of time required to perform the whole process.So, that's all boring, dreary and it distracts you from the actual task. What can we do about it? I've just thought: why there is no special operating system for researchers, scientists and engineers that can make their lifes much more easier. Here is a short list of what I'd like to see in it:
Primary (related to the previous paragraph)
- Math engine (some kewl mixture of MATLAB and MAPLE) should be integrated with OS environment. I want ability to run any math command in any shell opened and get result immediately (well, it depends on command). And all the calculated math data should be accessible through math engine from any other OS part.
- There should be comfortable math engine API available for most popular programming languages.
- There should be some standard for all the data types used in the math engine.
- All the external math algorithms should be integrated in the OS math engine. All that algorithms should use the same math data types provided by math engine (see 3).
Secondary (all the ideas that came to my head while writing "Primary" section):
- I'd like to see some semantic relationship database integrated to the OS where you can hold dependencies between papers, researchers, lectures, presentations, data sets, experiment results etc (take a look at the MSR Research-Output Repository Platform). For example, it would be rather kewl to visualize references between articles related to some research area as a graph. It allows you to filter the most important works quickly and opens many other possibilities.
- There should be some useful network interaction mechanisms. For example, I can stream experiment results in real-time mode to my colleague who will immediately visualize it. Or I can implement some math algorithm using OS math engine and ask my colleagues to share their computing capabilities with me. Or I can submit it to some computing server and then stream required data to it. And all of it completely transparently. And there should be, of course, some mechanism for relationship synchronization between network nodes.
- And, finally, it would be a pleasure to have some comfortable LaTeX editor and compiler with all the necessary stuff.
At least, unification of the data formats and math engine will help with exchanging data, prototypes, algorithms etc between researchers. Nowadays it turns into "Let's try to build it" game very often.
Well, it would be nice to hear other ideas and opinions. And then, may be, someone will read this post, get inspired and 5 to 7 years later new operating system will totally change all the scientific community =)
By the way, my team in the Yandex contest has the same name as this blog.
I don't think that there's a need for a completely new operating system to get the things you describe done. Developing a new OS would be an unneeded time and resources waste. Just think about creating a competing compiler for it, does this really worth the effort?
ReplyDeleteYou mentioned a lot of quite general things in your post. In every given project there is always need to make some unique decisions, so you can't create a universal mechanism for exchanging data better than that already exist (databases, electronic tables, simple sequences of numbers properly interpreted).
And on data type unification and computational API. You say that you don't want to learn tools already present in MATLAB to perform external interaction with it. But introduction of a new OS would bring much more difficulties than this, not to mention its bugs and limitations.
At least I can say that machine learning solves 3 major problems: classification, regression and ranking and each of them can be unified in terms of features, classes, function values and ranks very well.
ReplyDeleteAnd I suggest building that OS on top of MSR Singularity project. It will help with competing compiler problem and provide comfortable execution environment. And it's my favorite .NET and C#, yeah.