CSC492 Independent Study - Eye Segmentation Project (UTM)

Researcher: Wenxi Chen

Supervisor: Arnold Rosenbloom

Introduction

This web-page is created to report the ongoing Eye Segmentation Project. The main goal of the project is to build a program that identifies the pupil and the iris in input eye pictures. The problem was first introduced by a anthropology study. The study required an examination of the relationship between human genes and irises (in terms of color and pattern). So the researcher of that study wanted a program to identify the iris region automatically (excluding the pupil region). However, the image recognition methods described in this web-page can potentially be applied to many areas such as face recognition and object recognition in images. Moreover, the Machine Learning methods described can potentially be applied to areas outside of Machine Vision such as Machine Listening.

Background

To do the research, I have obtained a large number of standard eye images. Each image is either of size 3043 x 2036 or 4256 x 2848 that contains only one eye. I aim to achieve the goal of the project by experimenting regular image analysis as well as Machine Learning methods. Below are four sample images from the dataset (the images are scaled to be displayed properly):

Progress

Exploration 1: SimpleCV's findBlobs() and Smoothing Functions

SimpleCV was introduced to me by Prof. Rosenbloom, when I first started the project, as an easy-to-use image processing tool. Among many of SimpleCV's functions, the Image.findBlobs() function identifies the blobs in given images. By adjusting the intensity of the blob, the function was able to identify the pupil area relatively well in a few images I had started with. However, iris region could not be easily identified by the function.

The function's ability to identify pupil region was further tested by running it on all the eye images. (To save time, all the testing images were half the width and half the height of the original ones.) With unsmoothed images, there were several factors that prevented accurate identification of the pupils:

Dark Irises: Some eyes have darker irises than others (e.g., brown eyes vs. blue eyes).
Reflections: Large reflections of light in the pupil region.
Iris Pattern: Some eyes have iris patterns near the edges of the pupils.
Eyelashes / Reflection of Eyelashes: Because eyelashes are black, they bridge the iris and pupil if the eye is not opened wide enough. Also, the white reflections of eyelashes create noise in the pupil, especially at the edge of the pupil.

In order to remove the noises created by the factors above and to produce more accurate identification, smoothing mechanisms were applied to the images before using the findBlobs() function. Three smoothing mechanisms were tested (all of them are build-in functions of SimpleCV). They were 1) Image.erode() function, 2) Image.smooth() function (this function applies Gaussian smoothing) and 3) Image.smooth('bilateral') function. Among them, the bilateral smoothing produced the best results. However, even the bilateral smoothing function was not able to remove all the noises. In fact, only the eyelash noise could be partially removed. The other types of noise were too strong to be affected by the aforementioned smoothing functions.

Exploration 2: Multiply Image with Scalar (in SimpleCV)

Since the pupils are marked by some of the darkest colors in the images, one of the potential ways to identify the pupil region is multiplying the image objects with a scalar (in SimpleCV framework, image objects are 3-d arrays with the inner-most array being the RGB vector). The multiplication should be carried out element-wise. The scalar is multiplied with each value in every RGB vector. As a result, if the scalar is greater than 1, higher RGB values are increased much faster than lower ones, while a black pixel should not be affected by the multiplication at all. The expectation was that after appropriate multiplication, a typical pupil region should stand out from a much lighter background.

This method when carried out produced better result than the build-in smoothing functions in SimpleCV for pupil identification. After the multiplication, the noises from Iris Pattern and Eyelash were almost completely removed. The irises became much more distinguishable from the pupils after the multiplication. Although, the Reflections in the pupil region intensified, as long as the reflections were not located at the edge of the region (which was the case for all images in the dataset), the method was able to make the edge stand out, thus helping identifying the pupil region.

However, one of the major difficulties in implementation of the method was choosing the right scalar to multiply. Since the difference of RGB values between the iris region and the pupil region varied from eye to eye, different images could require different multipliers to produce effective results. For example, multiplying by 4 made "2004.jpg" a good candidate for pupil identification using SimpleCV's findBlobs() function, but not "2012.jpg". Automation on choosing the effective multiplier for different input images should be further explored in order to make the method efficient.

Exploration 3: Histogram Mode Separation (Global Thresholding)

A good practice for image segmentation is thresholding with Otsu's method or other methods that utilizes a global threshold. However, this approach was rejected after some experimentation, because histograms (gray and RGB) of some images only have one big mode (such as the histograms shown below).

Histograms of "2004.jpg"

Exploration 4: plotting RGB values and intensity across the middle

In order to gain more insight of the task, I plotted out the RGB values and intensity on the middle line (from left to right) of each image. The middle line was found to cross both the pupil region and the iris region in all images in the dataset. Below are typical plots of two images from the dataset.

Plots of "2004.jpg"

Plots of "2012.jpg"

The intensity plot is placed on the left of each pair. The RGB-value plot is placed on the right. While the majority of images produced intensity plots like the one from image "2004.jpg", some images with dark irises produced plots similar to the one from image "2012.jpg". The main difference between these two intensity plots is the change in intensity value from the iris region to the pupil region. As shown, the intensity plot of "2004.jpg" shows a rapid decrease when the line goes from the iris region to the pupil region (around x=350) and a rapid increase when it goes from the pupil region back to the iris region (around x=480). However, in the intensity plot of "2012.jpg" such change is not so apparent.

How about RGB values? In the plots of RGB values, the color of the line represents the respective RGB value (i.e., Red line represents R-values, Green line represents G-values, and Blue line represents B-values). It was found that while the values of Green and Blue had inconsistent patterns across the images, the Red values are relatively consistent that they always had apparent change when the line changes between the iris region and the pupil region in the way similar to the intensity plot of "2004.jpg".

Upon finding this potential pattern, thresholding on only the Red values was tested.

Exploration 4.1: thresholding on the Red values

Since the Red values are more consistent than the others, thresholding on the Red values only was tested with the findBlobs() function in SimpleCV. Unfortunately, the results were not satisfying. The problem remained - different images need different threshold values, that is, one value did not successfully threshold all images, and the SimpleCV build-in adaptive method was not powerful enough to find the right threshold value for each image. Thus, the method of using simple thresholding function on the Red values was rejected. Moreover, more powerful algorithms are needed for the identification of the pupil region and the iris region.

Exploration 5: Edge Explore (to be Cont.)

Edge Explorer is an idea I have for pupil identification. It is a specific algorithm for detecting the edges of eye pupils in images. Assuming a pixel is found to be an edge pixel of pupil, the algorithm would start working like a little robot to find its way around the pupil region, resulting in a circular contour, which would be the edge of the pupil. The algorithm is leveraged on the fact that every edge pixel has two neighbor pixels that are also edge pixels.

Unlike the traditional edge detection algorithms, Edge Explorer's application is more limited. However, Edge Explorer has two advantages over traditional generic algorithms in the specific case of edge detection of pupils. First, Edge Explorer can connect gaps between segments of edges caused by noises (e.g., eyelash reflections). Secondly, more importantly, Edge Explorer is able to detect edge between less differentiate regions (e.g., a pupil region and a dark iris region).

How does it work?

Edge Explorer has four motors. Each motor carries the explorer toward different directions. Edge Explorer is set to move only clock-wise. Thus, the direction of its movement depends on where it is at the current moment. The possible position of Edge Explorer is divided into four quadrants like the ones in Cartesian Coordinates. In each quadrant, Edge Explorer would have a unique set of directions it can move toward.

Edge Explorer's algorithm is not completed yet, however, I decided to shelf the idea for now so I can have time explore some other possibilities.

Exploration 6: Edge Detections

Several traditional edge detection algorithms have been experimented using SimpleCV. The first algorithm I tried was SimpleCV's build-in Sobel Edge Detection. The results were unreadable - The real edges were partially identified while the noises pervaded.

Although generally believed to be a superior algorithm, SimpleCV's Canny Edge Detection produced similar results as Sobel. Worth noting, it was very difficult to find the correct values for the two thresholds the function needed.

The Hough Transformation was then tested. I started with OpenCV's cv2.HoughCircles() function, which used the "Two-Stage Hough Transform" method. However, this function did not work well. After some research online, I decided to test with some of MATLAB's edge detection functions.

Potential Iris Identification Solution: Hough Transformation (MATLAB)

While edge detection algorithms in MATLAB produced similar noisy result like the ones in SimpleCV, MATLAB's imfindcircles() function generated promising results in terms of iris identification. Using imfindcircles() on scaled grayscale images identified the iris region in 87% (389/447) of the data images.

By examining the 13% data images that the imfindcircles() function was not able to identify the iris region from, it was found that the great portion of them were images of light blue, hazel or green eyes, such as image "2028.jpg" and "2030.jpg" in the sample image above. Thus, the failure in identification was likely caused by lack of distinction between the iris region and the white region outside the iris.

A preprocessing algorithm was introduced to increase the distinction. Because light colors have their RGB values close to 255, a good way to increase the distinction is to reverse their RGB value by subtracting each of them from 255, then multiply the result by a scalar that is greater then 1. For example, two pixels, (249, 250, 247) and (200, 232, 211), after the described transformation (e.g., using scalar 2 for multiplication) would become ((255-249)*2, (255-250)*2, (255-247)*2)=(12, 10, 16) and ((255-200)*2, (255-232)*2, (255-211)*2)=(110, 46, 88). As a result, the difference between these two pixels is increased by a factor of 2, helping the Hough Transformation Algorithm to distinguish them better. (Note, converting the image to grayscale before the preprocessing can slightly decrease the computational cost.)

By implementing the preprocessing method described above in a loop with a gradually increasing multiplier, the imfindcircles() function was able to successfully identify the iris region 96% (431/447) of time on the data images. Below are image "2012.jpg" and "2028.jpg" with the iris region identified.

"2012.jpg" with the iris region identified

"2028.jpg" with the iris region identified

However, for some images, like image "2012.jpg", the iris region was not identified exactly on the edge. There are a few causes of this. One cause is the noise created by the eyelash like in image "2012.jpg". Another cause is the shape of irises. Because irises are hardly a perfect circle, there will always be some discrepancies between the Hough Circle Transformation algorithm identified circle and the actually iris edge. For both causes, some correction mechanism may be needed, and possibly using Machine Learning, to further correct the edge identified by the imfindcircles() function. We should come back to this latter.

Exploration 7: k-means Clustering

After achieving some success in iris identification with Hough Transformation, I further experimented the algorithm as well as previously mentioned edge detection algorithms in identification of the pupil region. However, success was very limited. The imfindcircles() function was unable to identify most of the pupils even after I cropped out the area outside the iris. As a result, I decided to explore possibilities in Machine Learning algorithms.

k-means Clustering algorithm was the first to test. Because of difference in color between the iris region and the pupil region, clustering seemed like an easy shortcut to capture the difference and classify the iris region and the pupil region. Even in the darkest eye, there exists apparent difference between the two regions, thus in ideal situation, k-means should work to separate the them. Below are results of running k-means with k=3 on image "2004.jpg" and "2012.jpg" (only pixels around the lowest centroid were shown).

As shown, while for image "2004.jpg" the k-means algorithm produced nearly perfect output, for image "2012.jpg" the output has noise in it. Since "2012.jpg" is an image of a brown eye, it was suspected that the low degree of distinction between the iris and the pupil region was the culprit, which was backed by experiment results from more images. However, in the result of image "2012.jpg", difference between the noise in the iris region and the actual pupil pixels is still apparent.

To capture the difference and correct the results from the k-means algorithm, another k-means algorithm with k=2 was used. The binarized sample results are shown below.

Result from second k-means on "2004.jpg"

Result from second k-means on "2012.jpg"

As shown, after applying k-means for the second time, there is no more noise in the result of image "2012.jpg". This was the case for most of the data images. Moreover, for the results that still had noises in them, the noises were negligible.

Given the binarized results from the double k-means algorithm, further manipulation is needed to produce good approximations of the pupil regions in the input images. Therefore, the double k-means algorithm could be our potential solution for pupil identification.

Implementation Detail of the k-means Algorithm

The input of the k-means algorithm was a list of pixels in the already identified iris region. Since the pixels were RGB pixels, the clustering was done in a 3-d space. Furthermore, in order to reduce computational cost, instead of all pixels in the identified iris region, only selected pixels were used as input. The selected pixels were pixels on the radii that were 0, 45, 90, 135, 180, 225, 270, 315 degree (from the radius pointing 12 o'clock position) of the identified circular iris region. As a result, the number of pixels were significantly reduced while the selected ones approximated the pixels from all directions.

Exploration 7.1: "blue-diff"

Even though applying k-means for a second time removed the noise successfully, it also cleared out some pupil pixels that were in the results of the first-time k-means. This can be observed by comparing the first-time and second-time k-means results of both "2004.jpg" and "2012.jpg" images above. The blueish rim of pupil pixels appears in the first-time k-means results becomes thinner in the results of second-time k-means.

Since there was apparent color difference between the pupil pixel and the iris pixel in the results of first-time k-means, I suspected there was a better way to capture the pupil pixels other than applying k-means for a second time. As a result, I plotted out the RGB values on the middle line (similar to plots in Exploration 4) of the first-time k-means resulted images, in order to find some pattern. Below are examples of such plots for "2004.jpg" and "2012.jpg".

Plot for "2004.jpg"

Plot for "2012.jpg"

As shown in the plots above, among the irregular up and downs of RGB values, some areas have the blue line on top, the green line in the middle and the red line at the bottom. After comparing these areas with the corresponding pixels in the images, it turned out that this particular pattern of RGB values represented the blueish pupil pixels. Therefore, in order to extract this blueish pupil pixels, for each pixel I calculated the variable "blue-diff" (i.e., blue-diff = (Blue_value - Red_value) + (Blue_value - Green_value)). Then, I binarized the resulting images of first-time k-means using the sign of the "blue-diff" value (i.e., the pixel became 0 if its "blue-diff" is less or equal to 0, the pixel became 1 otherwise). Below are the "blue-diffed" images of "2004.jpg" and "2012.jpg".

Although there was more noise in the "blue-diffed" images than the resulting images of applying k-means twice, much more pupil pixels were preserved in the first. Moreover, like the two images above, the noise was still weak and scattered in the "blue-diffed" images. Thus, the noise was likely not to impede future implementation.

A potential question is that whether the "blue-diff" method can be effectively applied directly on the raw images. The answer is negative. Because the RGB pattern I utilized in the "blue-diff" method appeared in some irises pixels, without applying clustering first, the "blue-diff" method would produce huge noise in some images. As a result, only when the k-means clustering and the "blue-diff" method were used together, satisfying results were able to be produced.

Exploration 8: Neural Network Classification

Currently, I have been experimenting pupil identification with Neural Network Classification algorithm (Multilayer Perceptron with backpropagation). The idea is to use the cropped preprocessed (similar preprocessing procedure as in the previous described Hough Transformation method) iris images of the previously identified iris regions (including both the iris region and the pupil region) as input. Whereas, the output will be binary images with the same size that have the identified pupil pixels having the value 1 and the others having the value 0. The algorithm will learn to identify the pupil pixels inside the images through training. Since it is a classification algorithm, it uses logistic regression's cost function with regulation.

The first architecture I have tested consisted of the following: an input layer with size 80x80 (I started the experimentation with size 400x400. But the computation was extremely slow especially when hidden layer's size was greater than 50.), a hidden layer where the size was undetermined (I was looking for a proper hidden layer size that produces optimal performance), and an output layer with size 80x80. Below are four sample outputs masked on to the corresponding images in the training set (the outputs were predicted using thetas derived at gradient descent iteration 200 in the training that had training_size=100, hidden_layer_size=200, and lambda=0):

The overall results were not satisfactory. As shown in the samples above, the masks for image "2004.jpg", "2028.jpg" and "2053.jpg" are too big. This was the case for 34% (34/100) of the results in the training set. As a result, some analysis was done in order to identify ways to improve the prediction if possible.

Training Set Cost Comparison at Iteration 50

As shown in the chart above, the cost generally decreases as the size of the training set increases or as the hidden layer size increases. Thus increase the training set size and the hidden layer size will likely improve the accuracy of the predictions.

Cross Validation Set Cost Comparison Across Different Lambda Values with Hidden Layer Size 50 at Iteration 50

The comparison charts above shows that the lambda values producing the minimum costs in the cross validation are 0.32 for size=10, 0.02 for size=30 and 0 for size=50 and size=100. Given that the costs are derived with the hidden layer size being 50, I suspect that with training set size 100, the hidden layer size can be safely increased further without much regulation. However, larger hidden layer size with lambda greater than 0 is generally beneficial. It was also proven to be beneficial in the case of size=10 and size=30.

Exploration 8.1: Including the "blue-diffed" Data in the Input

Increase in input size is generally beneficial as long as an appropriate regulation term is used to prevent over-fitting. Including the "blue-diffed" data (see Exploration 7.1) in the input only doubled the input size (i.e., I have cropped the "blue-diffed" images in the same way as the original input image, thus they had size 80x80). As a result, the new input data has size 12800. However, the additional input should have a much more clear indication where the pupil region is. Ideally, the program will learn to combine the "blue-diffed" image with the original preprocessed image and filter out the noise and fill in the missing pupil pixels. Below are two plots of first 1000 iteration training costs (training_size=100, hidden_layer_size=2000, lambda=0): one with the "blue_diffed" data and one without.

By comparing the two plots above, it seems the "blue-diffed" data has helped the Neural Network to perform better. The cost without "blue-diffed" data has flattened out after the first few iterations and stays above 200 with minimal decrease over iterations. Whereas, the cost with the "blue-diffed" data continues to decrease after the first few iterations in a noticeably faster rate than the other. Therefore, input data with the "blue_diffed" data was preferred. (Input including "blue_diffed" data is the default input data in the later sections, unless explicitly declared)

Exploration 8.2: Two Hidden Layers

Besides increasing the hidden layer size, increasing the number of hidden layers can potentially improve the performance of a Neural Network. However, there's no standard formula over when increase in number of hidden layer is better than increase in the size of a single hidden layer. The general guidance I found regarding this matter was experience and experiment. Thus, I implemented a design with two hidden layers to experiment this possibility.

The design was largely similar to the previous one-hidden-layer design (including the additional "blue_diffed" input), the only difference was one extra hidden layer and change in the size of the hidden layers. I set the first hidden layer to have size 3000 and the second to have size 1000. If the increase in number of hidden layers could improve the Neural Network's performance, I should be able to see improvement with this design. However, that was not the case. I had experimented the design several times with random initializations. Below is a plot of typical first 500 iteration training costs.

The plot above shows a similar performance as the one hidden layer design with "blue-diffed" data in Exploration 8.1. The improvement was not apparent. Considering the significant increase in the hidden layer size (from 1000 to 3000 plus one extra 1000 layer), this result indicated the increase in number of hidden layers was not effective in the current Neural Network.

Exploration 8.3: Standardize the input

Even though standardizing the input data is not necessary for MPL, since any scaling can be canceled out by changing the corresponding weights, "standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima" according to Neural Network FAQ. As a result, I decided to try standardizing the input.

I experimented with two methods to standardize the input data. Method One standardized the original input data to have mean=0 and standard_deviation=1. The formula was: standardized_data = (original_data - mean) / standard_deviation. This is a relatively popular way to standardize input data in neural networks.

Method Two leveraged on the particular characteristic of the input data. Since the basic input data (excluding the "blue-diffed" data) was preprocessed to increase the contrast between non-pupil pixels and pupil pixels, more specifically, to make bright pixels much brighter, 14% to 86% pixels' intensities were increased to 255 (completely white). Below are two example of the preprocessed input. As shown, a large portion of the completely white pixels are not necessary for identifying the pupil pixels. Therefore, it would be beneficial to identify the useless ones and have them removed from the input data. However, identifying useless pixels and removing them was not an easy task (I might try to do this with pruning in future experiments). Thus, an easier way to simulate this is to set all white pixels to 0 and let the Neural Network learn to ignore the useless ones.

Following the idea, the basic input data was reversed by the formula: standardized_basic_data = -(original_basic_data - 255). Then combining the basic input data with the "blue-diffed" data (Although the "blue-diffed" data suppose to be logic, the values are not 1 and 0, they are 255 and 0 instead). Eventually, the entire input data was standardized to be in range 0 and 1 by dividing the data by 255.

Above are the two plots of the costs over the first 200 iterations using the two methods (sample_size=100, hidden_layer_size=2000, lambda=0). As shown, both standardizations improved the performance of the Neural Network that the costs decrease more rapidly than all previous attempts. Moreover, I noticed that the computation time for each iteration had decreased when the input was standardized by either of these methods, making future attempts with larger hidden layer size possible (assume the same computation power). Among the two methods, Method Two improved the performance more than Method One. Costs using Method Two decreases below 1 after 150 iterations, whereas costs using Method One remains above 10 after 200 iterations.

Exploration 8.4: Prediction over the Training Set

The images below are the new sample outputs masked on the corresponding images utilizing the "blue-diffed" data in the input and Method Two for standardization. The outputs were predicted using thetas derived at gradient descent iteration 200 in the training that had training_size=100, hidden_layer_size=2000, and lambda=0. At iteration 200, the cost is less than 0.12.

Comparing the masked image above with the previous predicted four images near the beginning of Exploration 8, the improvement in prediction accuracy is significant. All pupil regions are masked out almost perfectly. At 0.12, the cost is already small. It is important to note that some inaccuracy in the mask is due to the way the labeled data for training was created by hand. Furthermore, If the gradient descent continued to iterate, the cost would approach zero. Eventually, the algorithm would converge in less than 2000 iterations where the cost was too insignificant (about 1e-014).

Some Implementation Details

Above is the cost function for the Neural Network Classification. The function is derived from Prof. Ng's Machine Learning course on Coursera. "m" is the number of examples in the input; "K" is the size of the output; "h(x)" is the predicted output; "L" is the number of layers; "s_l" is the size of layer l; "Theta" is the weight from element i in layer l to element j layer l+1; "lambda" is the regulation factor. In the current algorithm, each element in layer l is used to compute all element in layer l+1. Moreover, a bias element with value 1 is added to the layer l when computing layer l+1.

To be Cont...

As the research continues, I will keep exploring the following:

Verifying the current results over cross validation set and test set
Continue to exploring different value for lambda (the regulation term)
Try to prune and shrink the input size
Try to fit the current Neural Network to normal sized images without changing too much computation power (pruning and standardization might help)
Explore the possibility of using smaller input to predict regional pixels

Random Thoughts

group the color so each image is transformed to have only three colors: black for pupil, blue for iris, white for the rest.
Haar Cascades
Linear Regression

Language

Tool/Library

General Reference

"Pattern Recognition and Machine Learning" by Christopher Bishop
"Foundations of Machine Learning" by Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar
Machine Learning course on Coursera by Andrew Ng
"An Interdisciplinary Introduction to Image Processing" by Steven L. Tanimoto
"Statistical Models of Appearance for Computer Vision" by T.F.Cootes and C.J.Taylor
Neural Network FAQ, periodic posting to the Usenet newsgroup comp.ai.neural-nets
"Neural Networks for Pattern Recognition" by Christopher Bishop