The Operating System class I am TAing comes to the file system part, in where a virtual file system layer above the different file systems is introduced.
On the way driving to Qualcomm New Jersey, a question came into my mind: Is there a virtual action manipulation layer between our mind and our motor system? For example, our mind may pass a command: grab the cup on the desk. After the interpretation of he virtual action manipulation layer, the same command is translated into several different low level commands to different motor systems:
To eyes, the command becomes Eye_Grabcup, which actually will focus on the cup;
To legs, the command becomes Leg_Grabcup, which may drive the legs walk towards the cup;
And to hands, the command becomes Hand_Grabcup, which is go and grad the cup!
Now when we talking about the Vision Executive and the Language Executive, is it reasonable to put a virtual action manipulation layer over them? For example, "focus" is a actual action of VE, and "search related knowledge" is the action of LE, both of them share the same virtual layer function: "Attend!";
Finally we completed the Qualcomm presentation~~ It always feels good to have a chance to share the idea and our preliminary results. The researchers from Qualcomm and other students attending the finalists also gave us a lot of valuable suggestions. We appreciate of every comment and still a loooot of work ahead~~
THIS IS A BLOG OF YEZHOU YANG, A PH.D STUDENT AT UNIVERSITY OF MARYLAND, COLLEGE PARK. MOST OF THE POSTS HERE ARE MY STUDY AND RESEARCH NOTES FOR QUICK ONLINE ACCESS. OCCASIONALLY, MY STUPID IDEAS WILL ALSO BE SHARED HERE.
Wednesday, April 20, 2011
Thursday, April 14, 2011
Something about TA
I am TAing Operating System this semester.
Every time I saw students struggling and being tortured by the projects (such as implementing paging system or file system on toy OS: GeekOS), I seriously felt the difference between US education system and the CHN system. Undergraduate students here are soooo busy working on all kind of projects, which granted them first hand experience on system implementing while in China, the OS class is more about the "concept".
Even if you know all the "concept", you still won't be able to actually how actually OS is implemented. The only "projects" are seeking into an old version of linux kernel and never ever actually implement a function... If we think it a step deeper, what we learned is how to "copy", not how to "create". That is a huge difference.
BTW, playing around GeekOS is interesting (as long as you don't actually need to implement those head aching projects).
Every time I saw students struggling and being tortured by the projects (such as implementing paging system or file system on toy OS: GeekOS), I seriously felt the difference between US education system and the CHN system. Undergraduate students here are soooo busy working on all kind of projects, which granted them first hand experience on system implementing while in China, the OS class is more about the "concept".
Even if you know all the "concept", you still won't be able to actually how actually OS is implemented. The only "projects" are seeking into an old version of linux kernel and never ever actually implement a function... If we think it a step deeper, what we learned is how to "copy", not how to "create". That is a huge difference.
BTW, playing around GeekOS is interesting (as long as you don't actually need to implement those head aching projects).
Friday, April 1, 2011
Micheal A. Arbib's talk
Michael A. Arbib
Template construction grammar and the generation of descriptions of visual scenes.
At the beginning of the lecture, Prof. Arbib showed us a static image about three woman racing, one fell down and all of them only has one leg ( the other one is synthetic). He showed us the image within 5 seconds and asked people to describe it. The interesting thing is that people tend to attend to each image part with a hierarchy following a attention mechanism.
In other word, people tend to focus on woman racing, and then "ah, someone fell down", and then "ah! they all have one synthetic leg!".
Then he introduce the constructive grammar, and it basically works like this: when a object or action observed, it automatically add a node into the graph structure, until every attractive thing and event has been perceived, the structure then going to parse to high level until a sentence or a description are generated.
For example, a woman punch a man in his face. If you attend to woman's fist, then the man's face, the description tends to be: a fist punch the man. On the contrary, if you attend to the man's face before the fist, the description tends to be: a man is hit by a fist.
And all the construction grammar generated based on a strong assumption: vision system gives perfect output.
***********************************************************************************
Now it comes to the earth: at least till now, vision system is far away from giving perfect output. All we can do is giving a some kind of probability to something or some action exists in the image or video. Then what will happen to this grammar? It becomes probabiliticalized (I made the word....)! Just like what we have done for the EMNLP paper, we take noisy vision output as input, using a tweaked HMM system to generate most likely description of the image. I believe if we can combine the uncertainty of vision output with the language construction prof. Arbib introduced here, a robust scene description generator is not far away.
PS: Prof. Arbib is a really humorous senior professor:
He referred the traditional "S->NPVP..." grammar as "Cheerleader" grammar, why? Give me a S! Give me a N! Give me a V!
He also said he is planning to retire in five years, although the number five is a constant~
^^
Template construction grammar and the generation of descriptions of visual scenes.
At the beginning of the lecture, Prof. Arbib showed us a static image about three woman racing, one fell down and all of them only has one leg ( the other one is synthetic). He showed us the image within 5 seconds and asked people to describe it. The interesting thing is that people tend to attend to each image part with a hierarchy following a attention mechanism.
In other word, people tend to focus on woman racing, and then "ah, someone fell down", and then "ah! they all have one synthetic leg!".
Then he introduce the constructive grammar, and it basically works like this: when a object or action observed, it automatically add a node into the graph structure, until every attractive thing and event has been perceived, the structure then going to parse to high level until a sentence or a description are generated.
For example, a woman punch a man in his face. If you attend to woman's fist, then the man's face, the description tends to be: a fist punch the man. On the contrary, if you attend to the man's face before the fist, the description tends to be: a man is hit by a fist.
And all the construction grammar generated based on a strong assumption: vision system gives perfect output.
***********************************************************************************
Now it comes to the earth: at least till now, vision system is far away from giving perfect output. All we can do is giving a some kind of probability to something or some action exists in the image or video. Then what will happen to this grammar? It becomes probabiliticalized (I made the word....)! Just like what we have done for the EMNLP paper, we take noisy vision output as input, using a tweaked HMM system to generate most likely description of the image. I believe if we can combine the uncertainty of vision output with the language construction prof. Arbib introduced here, a robust scene description generator is not far away.
PS: Prof. Arbib is a really humorous senior professor:
He referred the traditional "S->NPVP..." grammar as "Cheerleader" grammar, why? Give me a S! Give me a N! Give me a V!
He also said he is planning to retire in five years, although the number five is a constant~
^^
Subscribe to:
Posts (Atom)