Thursday, April 16, 2009

Skipping Spare Information in Multimodal Inputs during Multimodal Input Fusion

by Sun, Y., Shi, Y., Chen, F. and Chung, V.

Summary:

The article is based on the assumption that multimodal user interfaces that include natural communication modalities would enable the vast usage of new technologies. A technique called Multimodal Input Fusion (MMIF) is described as the combination of inputs from different modalities and interpreting them semantically. It is described that these inputs are symbol strings that constitute of individual recognized input elements from different modalities. As a whole, these symbol strings add up to multimodal utterance as defined by the authors.

The motivation to address the spare information problem has came to surface during the pilot study of Sun et al.'s previous attempt for discovering a flexible MMIF technique. It was observed that some inputs from different modalities do not have any semantic relation with some other inputs when present in the same multimodal utterance. This suggested the presence of spare information in multimodal inputs, which made it hard on the system while making semantic interpretations. Therefore, they proposed a MMIF approach that will leave out the spare information present, and make semantic interpretations of the rest of the inputs combined. They go on and explain the structure of the proposed multimodal input parser as a decomposition of the multimodal utterance into sub-utterances through a grammer called Multimodal Combinatory Catergory Grammar. After this decomposition, a matrix is created for recombination of the sub-utterances and the results with maximum symbols have priority going into the Dialogue Management Module, which is a decision module that outputs meaningful utterance when it is found.

Discussion:

The authors' approach for skipping spare information through the proposed parsing of multimodal utterances and recombining to meaningful ones seems to be an effective way for deriving semantic information. It is absolutely needed for multimodal computer-human interaction that is natural. However, when deciding for causes of spare information it seems that the authors did not take the possible pre-perceptions of users about the system or the cognitive load imposed upon them into account. When these have not been investigated throughly it seems that with increasing modalities included in the system, spare information would increase as well. This will seriously effect the recombination and decision process that is derived from a mutrix of all sub-utterances that is recognized by the individual input modality's engine.

Citation:

by Sun, Y., Shi, Y., Chen, F. and Chung, V. Skipping Spare Information in Multimodal Inputs during Multimodal Input Fusion. In IUI '09: Proceedings of the Intelligent User Interfaces Conference, Sanibel Islands, Florida, USA, 2009.

A Multimodal Interface for Road Design

by Blessing, A., Sezgin, T. M., Arandjelovic, R. and Robinson, P.

Summary:

The authors begin the article by stating that designing with a computer can be made more interactive and natural with speech and sketch-based interfaces. They wanted to show this in a representative interface for road design that will be used either for actual road design or road design for simulators. The common parts for these two include designing the physical aspects of the road such as the road itself, the signs, the place of the signs, etc. It is proposed that these can be specified with the sketch-based system. While, on the other hand, the design of the traffic behavior within a simulation could be achieved through speech.

After conducting necessary interviews with experts on road design, the authors came up with Multimodal Intelligent Road Design Assistant (MIRA) for their intended speech and sketch-based system. The mapping of features of the system to modalities has been done in order to meet the desired affordances, which are put out by the experts. For example, it is suggested that behavior of an object or a group of objects can be defined with speech, together by a traditional WIMP modality. Or, designing the road can be done with a pen-based modality. Furthermore, it is stated that the modalities has been used in a complementary fashion. While some features could both be mapped to speech or sketch-based modalities, the authors have chosen one of the two in order not to overload a given modality.

After dealing with mapping features to modalities, the authors continue with giving information about speech and sketch-based recognition engines. Finally, they provide the pilot testing and evaluation and finish the article by putting out the related and future work.

Discussion:

Overall, the article shows that a system with natural modalities such as speech and sketching is capable of making the design process more interactive. One interesting point about the pilot test indicates that the capabilities of the speech interfaces so-far have a negative effect on the users' part while trying to communicate with the system. A video that shows the usage of speech system has been shown to overcome this negative effect and it succeeded.

The article lacks a proper method for the assignment of modalities to the different capabilities of the system and this proves itself in pilot study. The mapping has been done in an intuitive manner, which lead to some recognition errors during editing operations. Some short words for editing such as "undo", "redo", etc. had been misrecognized due to the speech engines' high sensitivity for noise. The authors needed to change the intended modality for these operations to pen-based in order to prevent some serious results that may frustrate the end-user.

Citation:

Blessing, A., Sezgin, T.M., Arandjelovic, R. and Robinson, P. A Multimodal Interface for Road Design. In IUI '09: Proceedings of the Intelligent User Interfaces Workshop on Sketch Recognition, Sanibel Island, Florida, USA, 2009.

Wednesday, February 25, 2009

HMM-Based Efficient Sketch Recognition

by Sezgin, T. M. and Davis, R.


Summary:


At the beginning of his article, Sezgin gives the intention of the paper as proposing a novel approach to sketch recognition by taking into account the incremental and interactive nature of it. The incremental nature comes about the fact that every stroke is drawn one by one and interactive, since there is a two-way communication between the user and the computer.

He continues with providing the terminology for the paper. He gives the informal definition of a sketch as the messy hand drawings and formal definition as the sequence of strokes in which the x, y coordinates and the time t for each point is captured. He asserts that stroke ordering is important in recognition since a user's preferred order of strokes defines his/her sketching style. He continues with three steps of the recognition process as segmentation, which is the grouping of strokes that belong to same object class to the same group; classification, which is to determine which stroke group accounts for which object class; and labeling, which gives, as the name suggests, labels to the recognized objects' components.

The author gives the problem of the current sketching systems as treating all the users with the same recognition procedures and therefore ignoring their sketching styles. He suggests that s employing an approach for capturing these sketching styles provides efficient sketch recognition with polynomial time requirements.

A user study that has been completed indicated that people tend to be stylized while sketching and they are persistent in their styles in different sketches. From this intuition, Sezgin proposes to use HMMs in order to model different sketching styles. Next, he gives an overview of HMMs and goes on with the designing of HMMs with fixed and variable training data to incorporate the different lengths of observations that are present while drawing the sketches of the same objects with different styles in the training of HMMs. While modeling with fixed input length, after training the HMMs to capture different sketching styles with the Baum-Welch method, the calculation of the probabilities of every model presents the challenge of pre-segmenting the sketch. To overcome this challenge, Sezgin proposes a dynamic programming approach in the form of a shortest path problem. After segmentation, the classification is done by computing the probabilities of every model with Forward algorithm and choosing the model with the highest likelihood and thus the object class.

In designing with variable length training data, Sezgin gathers all training examples of an object, therefore embodying all sketching styles and training one HMM for each object class. After the training of HMMs the estimation for the probabilities of ending states is also calculated to help the recognition process. Again, the segmentation is achieved with the same way that is done with fixed length training data.

Evaluation of these two frameworks showed that, as discussed in the paper, with variable length training data, the recognition accuracy increases slightly. The author says that this is due to training those HMMs with more training data.


Discussion:


In his '05 paper, Sezgin brings many intuitions that seem to have very much effect on the sketch recognition research. The idea of regarding the incremental and interactive nature of sketches, using these to design HMMs and achieving polynomial time requirements for recognition is very impressive. The contributions achieved with this paper are huge; however, the most important parts of this paper are the discussion and future work. The author does seem to know what the drawbacks of the proposed systems are, e.g. handling single stroke objects, and shows where his research is headed.

There are some phenomena though he had not covered in his paper. It is not clear if the proposed systems do consider the recognition of filled-in or over traced objects, which are both very common in sketching. Another point of concern is that whenever there is a need to add a new domain, would the system needed to be designed all over again and for that specific domain or can a normal user do this by simply providing the training examples. Lastly, the system could take the correctly recognized objects and add them to the training set to increase the recognition accuracy.


Citation:


Sezgin, T. M. and Davis, R. HMM-Based Efficient Sketch Recognition. In Proceedings of the International Conference on Intelligent User Interfaces (IUI’05), January 10–13, 2005, San Diego, California, USA. ACM Press.

Sketched Symbol Recognition using Zernike Moments

by Hse, H. and Newton, A. R.


Summary:


Hse begins his article by first drawing the attention to the increasing interest to the sketch-based interfaces and he says graphical symbol recognition is the aim of this paper. Symbol recognition in sketches is achieved by adopting three different approaches in the community: statistical, structural and rule-based. In his work, Hse chooses a statistical approach where Zernike moments are used for symbol representation. Zernike moments are rotation and reflection invariant, thus qualifying to be effective for symbol recognition. The author proposes a method where the evaluation of Zernike moment features is done with three classification methods: Support Vector Machines (SVM), Minimum Mean Distance (MMD), and Nearest Neighbor (NN).

Before going further with the descriptions of Zernike moments and evaluation of the features, a pre-processing is necessary due to the scale and translation invariability of Zernike moments. Another pre-processing includes the approximation and interpolation of the data points to ease the calculation of the moments. Hse, then explains the Zernike moments briefly and goes on with the choice of SVM classifier since other classifiers have pretty straight forward algorithms.

In giving experimental results, two methods have been proposed. First one suggests that the user can train the recognizer with his or her own data, namely the user dependent test, while the second one suggests using a pre-trained recognizer that adapts itself with the addition of user examples, namely the user independent test. In the user dependent test, classifiers are trained and tested on single user's dataset where a cross validation of K=10 folds has been used. The classifiers are then evaluated on the dataset of all other users. This experiment results in higher accuracy rates with the SVM classifier when data is interpolated and with Zernike moments of order 8. For higher orders SVM classifier's accuracy is increases slightly, however there is a decrease in other classifiers.

In the user independent, classifiers are trained and tested with N-fold cross validation where N is the number of users. The test is done with a user's dataset where all other datasets are used for training. SVM classifier performs better in this experiment too, with data points interpolated and Zernike moments of order 8.


Discussion:


Hse's paper provides yet another way of recognizing low-level graphical shapes, this time using Zernike moments as features and evaluating them on three classifiers: SVMs, MMD and NN. The accuracy rates that have been achieved are impressing when taking the method's capability of recognizing shapes that is independent on stroke order, direction, scale, translation, reflection and rotation.

Apart from its valuable outputs, the paper does not seem to consider the next steps for recognizing sketches as a whole. The beautification process that is mentioned in the paper puts doubts on the handling of filled-in and over traced shapes, which are not mentioned in the paper at all. This may be due to the fact that the paper is for the recognition of a limited set of symbols that need not be emphasized when drawn (UML diagrams, PowerPoint slides, etc.).


Citation:


Hse, H. and Newton, A. R. Sketched symbol recognition using Zernike moments. In ICPR ’04: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR’04) Volume 1, pages 367–370, Washington, DC, USA, 2004.

Sketch Based Interfaces: Early Processing for Sketch Understanding

by Sezgin, T. M., Stahovich, T. and Davis, R.


Summary:


Sezgin et al. describe the motivation for the paper as enabling users to interact with computers in a more natural way, thus, freehand sketching, rather than using a gesture-based system. To do this, they propose an implemented system that converts pixels into simple geometric shapes such as lines, rectangles, curves and their combinations. The authors make us of the temporal information of the strokes that is inherent in on-line sketching. It is assumed that the user will use single strokes for drawing.

Their proposed system has three stages: approximation, beautification and simple recognition. During approximation, Sezgin deals with the conversion by handling vertices in line segments of the stroke and curved segments separately. They work on the intuition that while drawing, people tend to slow down at corners.

Therefore they take speed data and stroke direction as their input for vertex detection. In order to cast out noise, the authors use average based filtering and look for the minima of speed data above a certain threshold while the curvature is already low and vice versa. The thresholds are empirically determined as the means of each data set. Since both of these information sources give rarely good results when used stand alone, the authors suggest developing a hybrid algorithm that first takes the vertex candidates from both sources and deciding on the best fits while taking into account the system's certainty. Initial fit is achieved by taking the intersection of the candidate sets from both sources. Then at each cycle two new fits are created, one with best curvature candidate and one with best speed candidate and with respect to their least squares error , the new fit is chosen. The upper bound for the error term is also set beforehand.

Handling curve segments is done after finding the vertices. Sezgin draws the attention to the fact that the ratio of the Euclidean distance between two vertices and the actual arc length between them is likely to be 1 when there is a line segment and lower when there is a curve. After determining these ratios the approximation of the curves are done with Bezier curves.

After finding the vertices and approximating the line and curve segments beautification follows and simple recognition is achieved by examining simple properties of geometric objects that can be drawn with lines and curves.

Sezgin identifies the future work as including scale space theory to be able to recognize sketches with features that have different scales.


Discussion:


Sezgin's implemented system is well defined in terms of what is does achieve but, at the same time it seems to be avoiding most of the aspects that we consider inherent in freehand sketching and the feeling of naturalness. The idea for creating a hybrid fit with the information taken from both the curvature and speed data, and using the intuition that people tend to slow down when drawing corners does sound excellent in the way that it is shown in the paper. However, when people draw sketches they sometimes do it because they want to design something that they do not know how yet. So, they slow down, and maybe stop, to gather their thoughts about how the sketch should continue. Since this would decrease the threshold for speed data, more candidate vertices would needed to be evaluated and possible a number of them will be included in the hybrid fit. Another point is that, in the article, the determining of the error term is not discussed thoroughly as this might also be affected with this intuition.

The mentioning of handling over traces at the end of the article does also seem to be unnecessary since the authors only considered to work on over tracing just for the beautification process, as can be seen from the examples provided. In natural freehand sketching people use over tracing to not only as touch-up strokes to make the understanding of it clear but also to put emphasis on that particular object.

Last but not the least; the system does not seem to give the user the freedom to create objects and understand them with multi-strokes which is very common in common while sketching with pen and paper.


Citation:


Sezgin, T. M., Stahovich, T., and Davis, R. 2001. Sketch based interfaces: early processing for sketch understanding. In Proceedings of the 2001 Workshop on Perceptive User interfaces (Orlando, Florida, November 15 - 16, 2001). PUI '01, vol. 15. ACM Press, New York, NY, 1-8.

Specifying Gestures by Example

by Rubine, D.


Summary:


Rubine's article is based on an application framework called Gesture Recognizers Automated in a Novel Direct Manipulation Architecture, GRANDMA for short, for creating and manipulating recognizers from examples. After explaining the need for such a framework as being able to create "small, fast and accurate recognizers", Rubine goes on and describes various aspects of GRANDMA in an example GDP, a gesture based drawing program.

Rubine is interested in creating recognizers for single stroke gestures since it avoids the segmentation problem that is inherent in multi-stroke gestures. Here, a single stroke gesture is described as the pressing of a mouse button for initialization, moving the mouse to draw the gesture and lifting the mouse button or waiting for a certain amount of time to end it. Next, a click-and-drag interface is mentioned for creating the gestures to be recognized in GDP. The user adds the gestures to GDP by simply adding a new class in the interface and then providing examples for training. It is said that 15 examples per class is sufficient to reflect variance in size and/or orientation. The semantics of the gesture is described after the class is created to enable editing and manipulation of the gesture.

After creating the gesture classes to be recognized, the author presents a statistical method for recognition. In this matter, a gesture is nothing but a series of time stamped points. And the problem is to determine the class which a gesture belongs from a set of gesture classes that are specified by examples. The statistical gesture recognition includes extracting a vector of features from the input and classifying the vector as one of the possible gesture classes. The author empirically determines 13 features that enable computations to be done in run-time. A linear evaluation function is used to classify the input gestures. The dot product of a feature vector to weight vectors of each gesture class gives us the results of the evaluation function. The maximum among these results determines the class which the input gesture belongs, therefore concluding the recognition. The weight vector is calculated from averaging the feature vectors of the examples and averaging the common covariance matrix. Rejection occurs when the results of evaluation functions of different classes are close to each other. Evaluation of this system showed over 98% accuracy with 15 classes and 15 or more examples provided for each class. Even with different combinations of number of classes to be recognized and the number of examples provided the accuracy rate does not fall below 90%.

The author mentions two extensions to the framework. First one is eager recognition which proposes that the system being able to recognize an input gesture as soon as its ambiguity is eliminated. an the other one is multi-finger recognition which proposes applying the single stroke recognition to each of the individual strokes created by the fingers and then combining the results to classify the multi-finger gesture.


Discussion:


The framework GRANDMA presented in this article had struck me with its simplicity, in spite of the fact that almost every detail of its feature evaluation functions and classification method explained in the paper. The adding of a new gesture class together with its semantics and training with very few examples seemed trivial. There are, however, some issues related to drawing constraints that caught my attention. A simple example of this would be” x” -deletion gesture. It is odd that we should place the starting point of the gesture to delete a previously drawn gesture instead of placing the intersection point on top of it. This may be done in order to be efficient in storing information and calculation of the features. Moving on, drawing some gestures with a single stroke does not seem to be natural and with the expansion to new domains the single stroke gestures are likely to be irrelevant to the gesture class that it is designed for. This would give the user an unnatural feeling even considering the fact that users are expected to familiarize themselves with a program before using it.

Another part that made me think was the author's way of selecting features. Since the selection of the features was empirical and since the results were satisfying, there seems not to be an extensive discussion on that matter. However, extracting features and selecting among them is supposed to be an important part of recognition.


Citation:


Rubine, D. 1991. Specifying gestures by example. In Proceedings of the 18th Annual Conference on Computer Graphics and interactive Techniques SIGGRAPH '91. ACM Press, New York, NY, 329-337. DOI= http://doi.acm.org/10.1145/122718.122753