In recent years, several studies that apply computer vision techniques to the fashion industry have been conducted. These include algorithms that search for similar clothes by using images [1, 2] or images and text [3, 4], and algorithms that create images with arbitrary poses from images and posture data [5, 6]. We propose an improved fashion style recognition algorithm that fashion designers can use to check if their products are suitable for target fashion styles. http://www.aleeshainstitute.com/ When shopping online, people can easily search for fashion items according to keywords without making manual annotations. Another application is the automated tracking of fashion trends using online and/or real-world images. There have been several studies on fashion style recognition. Takagi et al.  created the FashionStyle14 dataset (Fig. 1). This dataset contains 13,126 images of 14 fashion styles: conservative, dressy, ethnic, fairy, feminine, gal, girlish, casual, lolita, mode, natural, retro, rock, and street. Fashion experts annotated the styles. They also conducted comparative experiments of classification using several deep learning architectures, including VGG , Inception-v3 , ResNet50 , and Xception . ResNet50 showed the highest accuracy. Kiapour et al.  created the HipsterWars dataset (Fig. 2). This dataset contains 1,893 images of five fashion styles: bohemian, goth, hipster, pinup, and preppy. Each image has a rating score ranging from 0 to 50 that indicates how strongly an image reflects the fashion style. The classifications and scoring were decided by users’ votes. They also developed the classification algorithm using handcrafted feature vectors and a support vector machine (SVM) classifier. Simo-Serra et al.  proposed the StyleNet, which used a convolutional neural network (CNN) and weakly supervised learning. Although the feature dimension is smaller than in other methods, this algorithm achieved high classification accuracy in the HipsterWars dataset. Nakajima et al.  proposed a method that extracts feature vectors by using a pre-trained ResNet50 after extracting the human area (area a person occupies) in a pixel unit. They used single shot multibox detector (SSD)  and pyramid scene parsing network (PSPNet)  for extracting human areas, and used the SVM for classifications. The study indicated approximately 1% accuracy improvement by using SSD and approximately 2% by using PSPNet in the HipsterWars dataset. All previous methods used a unified recognition approach that output a fashion style from a single input image; however, our method utilizes multiple pre-trained fashion style CNNs that take images of different body parts extracted from an input image. This was motivated by several successful approaches in the subcategory recognition for birds [17, 18]. In addition, we introduce a recognition approach that inputs human posture and grayscale as shape information according to our findings on how this information relates to fashion styles. Specifically, we prepare six pre-trained CNNs as feature extractors that take different body parts (whole body, clothes, head, and limbs), human posture, and whole-body grayscale images, and concatenate their output feature vectors. We call our feature extraction method “component-dependent CNNs (CD-CNNs)”. In experiments, we compared classification accuracy with other existing methods by using the HipsterWars and FashionStyle14 datasets.
This paper demonstrates how effectively our method for fashion style recognition improves the classification accuracy by using the HipsterWars and FashionStyle14 datasets. Our method achieved 85.3% classification accuracy and a 4.4% improvement over existing methods using the HipsterWars dataset. It also achieved 77.7% classification accuracy and a 5.7% improvement over existing methods using the FashionStyle14 dataset. Our CD-CNNs created six component images from an input image, extracted feature vectors from each component image by using a pre-trained ResNet50 for each component image, and concatenated these feature vectors. We used a SVM as a classifier and found the SVM’s best parameter by 5-fold cross validation. The best feature combination was found by evaluating all 63 combinations. We found that the best feature combination for the HipsterWars dataset was four components (whole body, clothes, head, and grayscale). On the other hand, the best feature combination for the FashionStyle14 dataset was all six components (whole body, clothes, head, limbs, posture, and grayscale). In accordance with results in both datasets, whole body, clothes, head, and grayscale are commonly important components in fashion style classification while the importance of limbs and posture components depends on the dataset due to dealing with different fashion styles. In the experiment using the FashionStyle14 dataset, we confirmed that posture and limbs (including shoes) helped improve accuracy. In the future work, we will study the key components for each fashion style and utilize them to further improve classification accuracy or recommend fashion items according to customers’ favorite fashion in online shopping.