It has special characters for the image extraction on deep web. Although the images in different deep web sites have different items such as the position, bulk, meanings, and there also have many yawp images just like the advertisement images and the useless images in the data record nodes like nodes named “Buy”, “collection”.
So we can first get the data region node, and then use clustering to extract the images we want. But for some deep web sites, they usually have a simple page and the images of the goods are in the detailed data record pages. So when we face to these sites, we should get into the detailed pages and use the methods similar to the Breadth-first search to extract the images. On surface web, there are also many images and texts about the data record on deep web, if we extract these images and texts for customers, they will understand the goods better. But different to deep web, pages on surface web are more daedal. We get the surface web pages from search engines and estimate whether the pages are record-correlative.
At last, we extract the images by the way of methods similar to the Breadth-first search to extract the images. This extractor divides into two parts: first we extract the images of the result data from deep web. This is called the image extractor of deep web. Second, if users want to know more about the merchandise, we get them from surface web. This is called the image extractor of surface web.
This extractor can extract the images layered in order to satisfy the users’ needs of different deepness, thus this extractor shows the idea of pay as you go  and is more agility and wider use. Figure 1 shows the main process of AIE. The detail is addressed in following sections.
Getting data reign node
Surveyed by result pages returned from deep web, we find that data records always locate at the center of the pages, and around the page there are often some yawp images such as advertisement images, websites introduction images. In order to deal with fewer images on deep web and make the result more exactly, we first find the data reign node, and then to extract images in the data reign node. By this way we can remove many yawp images and get a better outcome.
We first change the deep web result page into a Dom tree, and then we use both top-to-bottom search and bottom-to-top search methods to find out the data reign node. After getting the key words users sent, we first get the node named body (this is the root node in the Dom tree), then we calculate the number of the key words of each child of the node named body. Next we descend these numbers and choose the largest number (and the number is larger than the threshold) to restart this method until no node has the number of key words larger than the threshold (in this extractor, the threshold is defined 5). But some deep web sites have some nodes in the data reign nodes which have more key words than the threshold.
Figure 2 shows one data record about searching “Linux” on the website named “Dangdang”, from this figure we can see that there are a lot of key words (printing in orange) in one data record node. This can make the method of getting data reign node incorrect. In order to get the correct data reign node, we should reverse the method. At this time we use the number of one node’s child nodes which has key words in it instead of the number of key words this node has. We call these child nodes key-child-nodes. When the number of one node’s key-child-nodes is smaller than the threshold, we can get this node as a data reign node.
Extracting Data record Image
After we get the data reign node, we can extract all images in it using the label of HTML. The label of image in HTML is “<img>”. Image Extracting can be divided into two parts: Deleting noise images and extracting data record images. After we get all the images in the data reign node, we use the method called Agglomerative hierarchical clustering to extract the data record images.
First we get the width and height of all the images and mark them as X, Y, so each image can be denoted as a group of number with X, Y. Then we treat each image as a cluster, and union these clusters into larger clusters. The rule of coalition is that the distance of width and the distance of height between two clusters are both smaller then 10.