App and Term

Introduction of the Signal and Image Processing course, including the basic concepts, terminology, and applications.

Reference: Lecture from Kim Steenstrup Pedersen

Intro

How to represent that signals?

What is a signal?

Sound example

Here is an example of a sound signal it’s in stereo and it shows you a structured set of data with some temporal causality. There are some dependencies among the data points across time.

It is structured data because we cannot shuffle the order of these data points arbitrarily without destroying the actual meaning of the signal.

In this course, we focus on images that contrary to sound, are spatially structured data.

Images in color or grayscale

Images in color, there are dependent points, for example, the amplifier because they are sort of on the same object. (Each point is correlated with its neighbors.)

In the grayscale, the difference is that at every location here we only have an intensity value instead of having some representation of the color at that location.

Greyscale

灰度图是一种单通道图像，每个像素点只有一个灰度值来表示其亮度。灰度图中的每个像素点用一个数值表示它的灰度，通常使用0到255之间的整数表示。值越大，代表该像素点亮度越高。灰度图在图像处理中被广泛使用，因为它比彩色图更简单，并且容易处理。

Video is a temporal sequence of images

Images can also change over time and in that case, we normally refer to this as a video. So the video is a temporal sequence of images in video lingo. You should normally refer to these images as frames.

There is also a temporal dependency. So a point in this image here depends on both in space but also across time so that’s dependencies or correlations between the data elements here.

视频是一种动态图像，由若干张静态图像（帧）组成。每个帧中的像素点不仅取决于空间位置，还取决于时间。在视频中，随着时间的推移，图像中的对象可能会发生位置变化、形状变化、大小变化等。因此，在视频处理中，我们需要考虑到图像中像素点随时间变化的特征，进行相应的处理。

Images of various types

Multispectral images

An image from astronomy. It’s an image from something called the Slone Digital Sky Survey which has been captured with the Hubble Space Telescope.

In reality, at every location in this image, we have five measurements of the color of the light that hits the telescope.

For illustration purposes, these five channels have been converted into three color values that we can show on the screen so that we get an idea of the color.

But this is an example of a multi-spectral image what we have seen up to now is either color images for this. You will see later that we can represent this with three values. We have five values. But in the grayscale example, we only had one value.

The five values represent the distribution of a light entity or wavelengths of the light that hits the camera and their spectral coverage is illustrated here. => $5$ bands here.

通常，人眼对于颜色的敏感性比对于灰度的敏感性更高，因此将多光谱图像转换为人眼可识别的颜色图像是常用的方法。通过将多光谱图像的五个波段转换为三种颜色的值，我们可以更直观地展示图像中的信息，并且更容易提取出图像中的特征。（人眼可以识别的颜色范围较窄，只包括红、绿和蓝三种颜色。而u和z波段所代表的波长范围在人眼可识别的颜色范围之外，因此在多光谱图像转换为颜色图像时，通常将其映射到红、绿和蓝三种颜色。）此外，将多光谱图像转换为颜色图像可以减小数据量，从而提高处理速度和效率。因此，通过将多光谱图像转换为颜色图像，我们可以更直观地展示图像中的信息，并且更容易进行图像处理和分析。意思是SDSS中的每一个点实际都是五个测量值，但是因为人眼的限制，没有必要将其用五种测量值表示，用三个颜色值来表示每个点就行。

Hyperspectral images

It means that we have many bands. Here is an example of a Near-infrared image of wheat grains. At every location, we have 224 different measurements.

对于超光谱图，每个点可能有多个波长的测量值。例如，在这个图像中，每个点都可能有224个不同波长的通道。这些测量值是为了更精确地捕捉图像中的信息，而不仅仅依靠三种颜色值。超光谱图像比普通光谱图像提供了更多的信息。

波长是指电磁波的长度，它决定了光的颜色。在光学领域，通常使用波长来表示光的不同颜色。通道是图像处理中的术语，指的是图像中单独的一个分量。在光谱图像中，通道代表了对一个特定波长的测量值。例如，在多光谱图像中，每个点可能对应多个通道，每个通道代表不同波长的测量值。在超光谱图像中，通道数量可能很高，因此每个点可以表示出更丰富的信息。

If we consider a special location in one of the left images (collection), we would have an estimate of the spectral content. How much energy do we have at different wavelengths based on the light that hits the camera coming from this location?

This shows we still have the spatial structure in the image but we can have multiple measurements at every location. Color is one example, and hyperspectral is another example.

Kinect depth image

An image does not have to measure the intensity of light.

We can also represent images that don’t measure light but could measure other quantities. Here is an example of using a depth camera.

What this image represents is not the amount of light that hits the camera, but instead the depth or distance from the camera to a point out in the scene that the camera is taking a picture of.

In the depth map, the color coding here is that light grey means that we are close to the camera, and darker value means that we have further away from the camera.

XBOX 360: A stereoscopic measurement of the scene in this way estimates depth. It also includes a standard RGB or color camera that can take images on the right side.

So the point here is that what we saw in the first couple of examples was that at every location we had some measurements that related to the amount of light that hit the camera but we can talk about images that don't necessarily represent light, but some other quantities like for instance, a depth map.

3D CT - X-rays

We can also construct images that are 3D by nature. Here is an example of a 3D CT image of a piece of porous rock.

So that means, the image is no longer a flat thing where we have some measurement at every location. We have a volume of measurements. Concretely, these measurements will represent X-ray attenuation at the specific location.

The darker value is the lower the X-ray attenuation and the more it is like air.

We have cut out a small piece of rocks so what we are looking at here is looking inside the image volume.

The goal here was to figure out how the water would flow through a piece of rock and in this case you would need to understand all the microscopic holes in the material.

Medical

We can also use this in a medical setting and this leads to medical image analysis.

CT scan of the human torso. We got the scanner, just like taking cuts through the volume and we can sort of look at each of them as a slice through the torso and look at that as an image if we like. But in reality, it’s a 3D structure so you can start to do things like, for instance here, we attempt to segment out the lungs which are the blue structure here the blue net, and the airways which are the yellow points. So we can take image data like the black part on the right and then, identify structures inside the volume.

We can use 3D CT to reconstruct the real structure by virtual slices.

However, X-rays are only suitable for hard structures, like bones, instead of soft tissues like fat and muscle and brain. Then we have another form of imaging modalities.

3D MRI

The difference from what we saw before was a change of modality, so and actually whereas in the X-ray case, we are measuring the amount of light that hits the camera f the scanner since element. In this case, we are considering the change of magnetic spins in each of the atoms inside the material that we are considering.

We are measuring something different than the amount of light. We’re measuring something to do with magnetism.

3D DTI

3D image and each point (voxel) value is a flow tensor.

Another example of images where each of the locations does not necessarily pertain to a measurement of light.

It’s pretty difficult to visualize this because it’s a 3D image, but instead of having only one value at every location, we have a tensor that you can think of this as a generalized matrix. It says something about the flow of fluids in the material. What you see here is the result of using such an image to try to identify the signal pathways in a human brain.

It is difficult to show you the raw data but we can show you the sort of the products that you can get out of this image.

Image processing versus image analysis

What is the difference?

Image processing is all algorithms that take image as an input do some alterations to the image and spits out as output a new image.

Image in, image out.

Image analysis is we consider a collection of algorithms that takes input an image or manybe even multiple images, but instead of spreading out a new image and what we do is spread out some form of interpretation of what we see in an image.

Image processing is an essential part of analysis, it is usually something you need to do as the initial part of your analysis of what you see in the image.

The interpretation means here, for instance, if you think in the medical the setting that we want to identify locations of tumors in the lung. We want to diagnose a lung cancer. in that case, we would take a CT image as input, 3D image image, and then, produce maybe a map of locations of cancer of just spit out a number that says that the risk the percentage risk of you developing lung cancer with it in the next year and so on.

Example of Image Processing

Image enhancement by histogram equalization.

We have a dark image and what we would like to do is to perform an operation which produces a more bright version of the image. (Global)

Histogram equalization has many variations. => Adaptive histogram equalization but the point is the same here. We take either of these images as input and spit out a new improved or enhanced image.

Example of processing and analysis

Image Segmentation

The task is we’re given an image and we would like to identify regions in this image that belongs together.

The high level goal would be to identify regions that belong to the same semantic object something that you would put a meaning to.

For instance, we have a mountain in the background. Maybe we really would think of that as being one region that where we would like to say all locations that are on this mountain region, they belong together. They form a segment. => (Segmentation Task)

You can do clever things where you take an image of a llama on a mountain background and you can combine an automatic algorihm with input from a user. For instance, if we consider this specific case, the graph cut algorithm, we need to mark some areas and say, oh, this red area here belongs to the background segment. The white belongs to the foreground (llama). And then, an automatic algorithm uses this as an input plus the image and then it can generate a nice cut out of the llama where we’ve removed the background the mountain part.

This is segmentation actually combination of both processing and analysis.

Example of image analysis

Computer vision is concered with problems that comes from images take with what you would normally think of as a camera.

You could have problems like object recognition and detection. One problem could be to say here is an image, please tell me which objects you see in this.

Representation

Basic terminology and representations

Sound

We already know that a sound signal is a time sequence of data, and this is temporal causality so we cannot change the order of these pieces of data.

Sample

If we take a mono signal, we only have one wave like this. These waves here represent the waves that the membrane in your speaker would have to generate the corresponding sound so that your ears can hear this.

But when we digitize these signals, for instance, we had a microphone, and then, we pass it through an analog-to-digital converter. Cut up the signal in time as one part. In this way, we generate a list of scalar quantities. Each of these quantities would represent how big an oxalation we have at that particular time.

So time goes along this axis here and the y-axis would represent oxalations of the membrane if we want to represent the play of the sound.

This list of scalars, each of them, we call a sample and this is then a collective samples.

某一段被切割下来的信号作为一个部分，我们产生了一个列表的标量，标量中的每个值就是一个sample，这也是为什么这个标量列表成为samples。

So this temporal causality means that I cannot take these samples and reorder them in any way. If we do that, we destroy the actual signal.

Resolution 分辨率

The resolution of the signal is given by the number of samples that we have per some unit of time, normally it is per second. And this is usually referred to as the sampling rate. This rate represents how many samples do we have over time (per second).

"赫兹(Hz)是频率的单位，表示每秒钟的周期数。1 Hz 等于每秒钟 1 个周期。秒(s)是时间的基本单位。因此，赫兹和秒是相关的，频率的单位为赫兹是因为其表示每秒的周期数。

If we consider a stereo sound signal, then we have two of these curves here. So we have two tracks of samples. One from the left channel and one from the right channel.

You can represent this in different ways in the computer but you could for instance think of it as it’s a list where each entry is now a 2-vector. A vector that has two numbers in it. One number that would correspond to the oxilation in the left sample at that specific time. And the other one represents the oxilation in the right channel.

Image

For an image, this is usually related to some spatially structured data.

As you know, an image can be some form of n-dimensional structure, and it is common in the computer that represent this as an n-dimensional array of values.

If the image is a 2D array, then we call each of these elements or locations in the image, pixel, which is short for picture element.

If there is a 3D image, we refer to an observed measured value at a specific location as a voxel. Voxel refers to volume element.

The ideas is that in the 2D case, you only consider pixel is some measurement in some small area, the voxel is actually a measurement in very small volume.

The resolution is given by the number of pixels in the image array.

我的理解是，像素就是图片中的某一个点或某一个位置，分辨率就是指一个图片所包含的像素个数。像素越多说明粒度越细，分辨率越高，图片越清晰。

If we change the resolution of an image, it does have some consequences.

The first one is the highest resolution, then we halve the size of the resolution. And then, we have four images. => Harder to see. Recognition $\downarrow$.

Hence, for images, the resolution level says something about the minimal details that we can represent with a particular image.

So it is the same object, but we are using fewer pixels down to represent the object and it comes that we have less information, and less details about the specific object that we are looking at.

Optical image formation

Camera obscura / pinhole camera

In the scene out here, the light out in the real world travels through the pinhole and gets projected onto the image plane and this is where the final image is formed => Perspective Projection.

当如果我们的照相机这么生产时，我们的看到的图像会是倒过来的，所以我们引入了virtual image plane，把这个放在照相机前面，然后返回这个上面的图案就行。

The digital camera

But in a digital camera, we don’t use the same films. We have some kind of sensor usually in the form of a chip that reacts to light that hits the chip.

And there are different technologies. The most common one is the charge coupled device (CCD) or complementary metal-oxide semiconductor (CMOS) chip on the image plane.

What is important to realize is that in either type of chips are divided into an array of light sensitive areas. Each of these boxes would be a light sensitive area that in the end would correspond to a measurement at a specific pixel in your digital image.

Lights go through our camera, and hits the sensor, right here. What the sensor does is it measures the amount of light by the photoelectric effect and converts this into a electrical current and this current is that then read out of for each of the elements and then it will be digitized quantized into a value that we can represent in the computer.

If we want to construct a color camera, there are two technologies to do this. One is called Bayer masks or a more advanced thing where you have a multiple chips if you want to construct a standard RGB color representation. Then you could do this by having three chips, one chip that has a coding that only allow light in the red part of the spectrum to go through. Then you do some tricks with some mirrors and so on in the camera.

The most common used in comercial cameras is the Bayer mask. The idea is that we take our center array here, and then put some coding on just as we did with the individual chip case. That means, at every location on the chip, I get a measurement of only one of three color channels. I get a measurement for green, one for red, and one for blue and one for greens and so on. To form a color image, we have to take these measurements and converts them into something that would represent a measurement of the three values at one location. There are different algorithms behind this. One way is to take the following method.

If we would like to know the central one, we only have the value for green. What we do is taking the neighbors’ values to get an idea of what the red / blue channel would look like. No matter what we used, we finally get a three vector that represents the RGB values for this pixel.

What do we represent in each pixel

In general, a pixel or a voxel in multiple dimensions of would represent some measured physical quantity and this could be the amount of light like the amount of photons energy that hits the sensor at a specific location.

So if we consider, in this point here, it is a color image and it’s represented with three values, a red, green and blue value.

And in general, a pixel or a voxel can contain a scalar value. For instance, in a grayscale image. Triplets of color values or vector colors like RGB. Or in hyperspectrum case, it would be like a 224 dimensional vector that represents the measurements of that location.

If we like, that would represent whatever we measure (vectors, matrices, and tensors) at that location. 1, 3, 224, ….

Quantization of samples / pixels / voxels

In the sensor chip, we convert light into an electrical current that pass it through an analog-to-digital converter that takes this electrical signal and digitize it.

We take whatever we get out for this specific pixel and convert it into a number that we can represent with some fixed number of bits and the number of bits we use. This is what we refer to as the quantization level. If we consider some common choices. For instance, below.

The indexed color images is a clever way of making a compression of an image representation.

Effect of Quantization Level

The example with having 8 bits to represent the values in this X-ray image. So the intensities would represent the attenuation of X-rays as pass through this patient and how as it was perceived on the opposite side of the X-ray source on a piece of film or a digital sensor.

If we take 8 bits, it is obviously a grayscale. However, if we take 1 bit, it can only represent two values at every pixel, either black or white. And all the details are gone if we compare it with the original image. But we still have the outline of the skull but we’ve lost a lot of details.

If we consider the two-bit case, what we can represent is at every location, we can have four different values and this means that there are actually only four distinct gray color values that you can see in the image. More details can be seen in this case compared to the binary case.

What happens is as we lower the quantization level, we are changing the dynamic range of the image. The amount of variation in the intensities that are represented in the image gets lowered. The extreme case is that we can see a lot of details on 8 bits. If on 1 bit, we only have the rough outline of the object that we’re looking at.

Image coordinate systems

Images as arrays or functions.

How a digital image is formed and how do we represent this? We need to talk about an image as something that we can apply an algorithm to and we need some terminology for this and to do this.

We start out by introducing image coordinate systems and they’re basically two ways we can look at a name which we can either think of it as an array or as a function.

So, here I have an image.

原点在上面的原因是因为在很早之前，我们采用小孔成像（Perspective Projection）的原理，所以在修复之前，我们有了这个定义，所以原点在上方。

So, what are the coordinate systems? If we think of the image as an array, then we could index pixels in this array. The vertical part as being rows. Along the horizontal axis, we have columns.

Example => I want to consider the intensity at a specific pixel, so I choose a row $r$, and a column $c$, and look up in my array. So this is one interpretation.

Here’s another one and that is to think of an image as a function of two variables, $x$ and $y$. Use the standard way of thinking it that the horizontal axis is the x-axis, and the vertical axis is the y-axis.

Example => What is the intensity value or pixel value at the location $x, y$. The original rows will correspond to the $y$, and the columns will correspond to the $x$.

所以用这种坐标轴的时候，一定要讲原先的array的r和c切换顺序，不然结果会很怪。

Why do we need to represent it as a function? From the perspective of the algorithm, we can do some mathematical operations (derivatives). Additionally, when you think of an image as a function, you can visualize the image in a traditional way, but also we can visualize it as a function. The $z$ vertical axis would represent in this case the intensity value at the specific pixel.

Color theory and color spaces

The representation of the color.

Visible light

In this lecture, I’ll be presenting how we represent colors and in general spectral measurements when we do images.

Visible light is just electromagnetic radiation that has energy in specific wavelengths. Normally, this range of visible light would be 390 nanometers to 750 nanometers.

The near-infrared means nanometers from around 700 to 1800 nanometers.

This range is what we could see with an ordinary optical camera. We would see X-rays if we use the X-ray machine or an X-ray-based CT scanner.

Perception

What does color really mean?

We can put a name on the color but this is actually a psychological phenomenon. It’s perception and then a language thing we have defined. We have an agreement in our culture that red looks like this and so on, but there are a couple of key terms that we need.

When we talk about colored light and that humans perceive that as color light it can either be chromatic or achromatic. Chromatic light means that we see the light as having color. Acromatic light means that we perceive the light as only intensity. You can think of it as any grayscale image, that’s the representation of achromatic light.

Light intensity is the amount of light that you perceive either with your eye or with your camera can be measured in different ways.

In the physical description, we use the term called Radiance which is a measure of the radiation energy flux per solid angle. => The amount of energy that flows onto some unit area. (e.g., CCD chip)

Luminance is the perceptual description. For instance, if you go and buy a light bulb in the supermarket, you might see Luminance as a unit that is described in the package. Luminance is measured in Candela per square meter. Candela is the amount of light that comes from standard candlelight.

Contract to Radians, Luminance is the perceived light energy.

If you look at the light, you can say how bright is this? Radiance is a physical measurement but doesn’t relate directly to the way we would perceive the amount of light that hits us. Brightness $\Leftrightarrow$ Luminance

We also have a tendency to use the word intensity also be a synonym for luminance and brightness.

To describe chromatic light (colored light), we need to use the notion of color space and this is linked to how we as humans perceive or sense color.

Trichormaticity

How humans perceive light by using trichromatic representation, which means we perceive light in three bands. We use this signal to interpret this as a specific color.

We have inside the retina of our eyes, we have two types of lights sensitive cells. There’s a category of cells called cones that are sensitive to different ranges of colors, the light that comes in different colors. Then we also have rods that are sensitive to light intensity. They are also what you use when you see in the darkness or a night.

The rods don’t measure color, that’s only the cones.

The cones come in three types. There is a type of cone cells that is red-sensitive and, green-sensitive, and a third type that’s blue-sensitive.

If we look at the spectrum of light, what these three cone types do is that the red-sensitive cell has a specific absorption curve or a curve that represents in which range of wavelengths it would react to the light. In terms of digitalization, you can think of it as we have a sensor that can only see red light. The rest two show the same but in different wavelengths.

Overlap

Any entity that comes at 450 nanometers provides a reaction for the blue cells, but it also provides some reaction in the green cells and also in the red cells. The black curves represent how much the cell would react to 450 nanometers. The red won’t react much because the black curve has fairly small at 450. (The black curves represent how well the cell would absorb energy at that wavelength.) But the blue cell on that wavelength would react quite a lot because, at 450 nanometers, we are almost close to the top which is around 445 nanometers (Peak Preference for Blue Cell).

Remember that we had this thing with the Bayer masks and you should have some coding and so on. Also, having an absorption curve, it’s not the same as how it would look in our eyes, but there is some curve that represents how the sensor reacts to different wavelengths.

Then, the black curve is a high-resolution representation of the spectrum of the light that hits the camera. For instance, the center of this galaxy. Each of these curves would be similar to the absorption curves that we just saw for the human eye.

Sensor Model

The black curve is how our eyes or chips react to light at different wavelengths, we will represent the black curve with only 5 numbers in the previous case.

If we introduce some mathematical model, what is actually going on with this? Let’s consider a sensor that only has one type of response. $s$ with the absorption curves.

Let’s say we have a sensor that has this curve it reacts to light within these wavelengths. We now observe the light that has a spectrum $\phi$. The spectrum is a function of the wavelength, just as the response curve is a function of wavelength.

What happens in the sensor is that we take this response curve and then, we use it to measure the amount of energy that we actually perceive from this source that we have right now. What we would like to get out of the sensor is a number. So we take a snapshot. We collect an image and add a specific pixel, we measure with this response curve and we want to get a number out. And how do we do that?

It is represented by this integral and what it does is that we start out by taking the response curve and multiplying two functions of the wavelengths. This could look like this completely hand-drawn picture.

What the integral does is it sums up the area on the curve, so this gray area is the total amount of energy that we perceive at each of the wavelengths which that specific sensor that has this responsive curve. So, this area under the curve that is basically what we get out of the sensor.

So mathematically, we could think of it as a product between the response curve the actual spectrum and then, the integral overall wavelengths, and that gives us a number.

Tristimulus

Now, how can I use this to understand how human perceive light?

Humans perceive light wiht these three color sensitive cells, and this is called the tristimulus.And that means, we basically perform the integral operation three times. We have a responsive curve for the red cells, a responsive curve for the blue cells, and a responsive curve for the green cells. Then we have three values, three numbers that we get out of this integral here.

That humans talk about the color is normally via primary colors, and it’s a consequence of tristimulus representation in the human visual system.

But we only measuring light or perceiving color with three values. It is important that with this three value representation, we have an additive model to form new colors.

The thristimulus linear basis

Now we can expand a little bit more on this and say, if I want to represent colors in a computer, how can I do this?

Now, I know for whatever sensor I use, it has some responsive curve. I can somehow perform an experiment and measure this one and then, we know how to take a specific spectrum and convert into a sensor measurement, at least in theory.

But I also want to be able to work with the color representation.

The nice thing is that if we take the primary colors. The red, green and blue, we can use these to form a linear basis.

So it is a linear basis that represents colors and this basis will refer to as a color space and it’s three dimensional, because we have this tristimulus we have three values.

How do I do this?

_I assume that I have a sensor and this sensor now has three response curves_. So here’s one response curve for the $i-th$ type of cells. So it could be the red or green or blue. Then we need a reference color spectrum. You should think of $\phi_j$ as being that I have three flashlights that each represents the primary color. So I have a red flashlight, a green flashlight and a blue flashlight. I shine each of these flashlights onto my sensor, and record the responses that I get with the sensor. If I do this then I get a collection of measurements. I get both the measurements for the red flashlight, for the green flashlight and for the blue flashlight. Similar, for the green responsive sensor, I get a response for both the red, green and blue flashlights, and so on.

Now we get a matrix and we can use this to represent any color based only on these primary colors.

So with these measurements that I have in this $p_{j}$ matrix, I can form three vectors. We consider the red flashlight and now I consider the red responsive sensor, the green responsive sensor and blue responsive sensor. How they response to the red flashlight? For that, I form the red flashlight.

\[s = \alpha p_R + \beta p_G + \gamma p_B\]

And then, I can do the same thing for the green flashlight and blue flashlight.

Together these three vectors they form a 3-D basis that spans the color spance given by these three primary colors, the red, green and blue.

That means, I can get any new color by taking a linear combination of these vectors coming up with some constants $\alpha, \beta, \gamma$. I multiply each of these vectors and sum up together and that can represent any color that I can perceive with this sensor that we have with three responsive curves.

我的理解是，由于前面我们讨论了对于一个反应曲线来说，我们如何测量他的能量？我们用一个强度的光去照射这个感光曲线，能量方程乘上感光曲线方程所得即为该感光曲线所能获得的能量。接着，我们还原到照相机原理。我们的照相机里有一个包含三种感光曲线的芯片，我们先用红色闪光灯去照射芯片，测量这三种感光曲线对于红色闪光灯所反映的能量，返回一个三元向量。对绿色闪光灯和蓝色闪光灯也一样，然后我们就形成了我们这个芯片（相机）所能识别的色彩（空间）了。

Let’s consider with these primary color flashlights. That means, I need to define a standard flashlight to define this color space.

Metameric color stimuli

For the standard flashlights, this can be done with CIE which defines the primary colors using monochromatic light sources. Monochromatic refers to that there is only energy in one wavelength.

So $\phi_R$ is the red flashlight would have a spectrum that’s $0$ every where except at exactly $700$ nanometers where it’s $1$ or some unit of energy. Other two are the same. This is sort of the idealized or standard way of defining these primary color flashlights that we would need to form the spaces.

You already know these RGB representation but those flashlights that defines these basis vectors and then we can do any combinations by changing the constants $\alpha, \beta, \gamma$ to form any RGB tristimulus representation of colors.

For humans, this would lead to the basis vectors.

\[p_R=(1, 0, 0)^T, \ p_G = (0, 1, 0)^T, p_B = (0, 0, 1)^T\]

The outcome is a three color vector and sometimes, we might cannot tell the difference because of the way that we’re measuring the light using three values. There will be combinations that would lead to exactly the same $s$ vector which would mean that we cannot perceive any different between these colors so that and this is also true for human vision. => metameric color stimuli

The Red-Green-Blue (RGB) color space

Now we can represent color in an image with RGB color space.

It represents a cartesian coordinate system and how to add colors together inside of the coordinate system.

The only thing is that you’d like that the color values remain positive in this way of the defining is it does not make sense to talk about a negative green value or negative blue value of that matter. So it is in the positive quadrant inside this coordinate system.

Once we have done quantization, there is also some upper limits to the ranges of color that we can represent and also the granularity of how many different colors we can represent with this quantization level.

If we use this standard, a true color 24-bit representation where we have 8-bit per channel, then we can start on interpret some colors in this space. We can say that $(0, 0, 0)$, we have black, and that’s the origin of the color space. On the opposite site, at max value in all channels 255. We have white and along the diagonal that you get from black to the white point you have all the gray values use, but as soon as you move away from that diagonal you start to get some colors and as is move a close to the red axis and move along that actually comes more and more red.

_So when you do operations in this color space, you should remember that negative color values does not make sense and we've picked a qunatization. There are also a maximum limit, that;s a maximum value that we can represent. Anything beyond that while that's an overflow at least in the computer we cannot represent._

RGB color space

Each channel can be viewed as a gray image representing amount of colored light in that channel.

We can take the image and then, slice of one of of the channels, so we could first then slice off the red channel. Now we have an image where each pixel only has one value. It represents the amount of red light but we only have one value. But it also means that we can now try to visualize the individual channels alone.

We notice that, in the red channel, we have red lego brick that apprears very bright in the red channel. That means there’s a lot of red lights that was measured in those pixels. If you consider that green and the blue doll, you can see that this is sort of dark in the red channel.

If we go to the green channel, the green brick is not bright as this the red one was up here. But it’s at least sort of light gray version which means that here is some green in that light that comes from pixels that hits the green lego brick.

It’s almost the same shade of gray as the green block and this reveals that this camera has a severe overlap between the green and the blue channel.

Converting RGB to gray scale

We can do various operations on the color spaces. For instance, with RGB reprsentation of color space, we can do tricks converting this into a grayscale image. In our previous example, we just sliced off each of the channels and visualized a channel alone.

We can do this by realizing that the color space is just a linear representation. I can do things like adding the colors together, so maybe I can find each piece of information that takes an RGB vector and converts it into a number that would represent the grayscale value.

We can use a projection of our pixel value ($R, G, B$) onto the color vector ($a, b, c$).

\[Gray = aR + bG + cB\]

This gives me a number, if I pick $a, b, c$ in a clever manner, I can start to interpret this as something similar to a grayscale image. I want to have some assumptions like I want to make sure $a+b+c = 1$.

We can pick $a, b, c$ arbitrarily that respects the formula. However, we also have the standardisations of this.

灰度转换公式

\[Gray = 0.299R + 0.587 G + 0.144 B\]

With CIE, this looks like a perceptually pleasant grayscale image version of the color image.

Perceptual color space (HSV)

Hue-Saturation-Value

There are several color spaces that we could talk about.

The perceptual color space remedies some of the problems that we actually have with the RGB space.

It is difficult to separate the intensity of brightness of the light from the actual color that we perceive these two things we intertwined and it’s also difficult to cut up RGB space instead of nice regions that only represent greenish colors and bluish colors.

Hue-Saturation-Value tries to fix this problem.

We make a transformation from RGB space into a new color space HSV by the definition above. It still has three values, Hue will represent the color and you could think that this is almost a direct representation of the wavelength from the invisible spectrum. Hue is like an angle around a unit circle where we get different colors as we travel around the circle. At some point we return to the starting point and get starts all the work again.

Saturation is the second number that represents the purity of the color or how much white has been mixed into the color.

Value represents the luminance or the brightness of the color.

For a low value, we get dark. Also, if you move the saturation axis, you can go from a very clean color to white in the center.

How do we do this? It’s a non-linear transformation of RGB whereas switching from RGB to the grayscale level representation.

HSV color space

Each channel is an image.

Why do the background and the blackboard behind get some noise? That’s because, in the saturation channel, there is some numerical noise. It comes from the fact that the values the original RGB values at these locations are very close to zero. So as we do the transformation we get garbage out in the Hue channel and in the saturation channel.

So, how do the transform?

RGB to HSV or HSI

Summary

Example of applications of image processing
Image basics:
- Signal and image representations:
  - Resolution: How many pixels do we have?
  - Quantization: How many bits do we spend to represent the values and each pixel?
  - Coordinate system
- Representation of color - color spaces:
  - Sensor model and tristimulus
  - RGB
  - HSV