Fashion representation Thus far we have shown the styles discovered by our approach as well as our ability to forecast the popularity of visual styles in the future. Next we examine the impact of our representation compared to both textual meta-data and CNN-based alternatives. Meta Information Fashion items are often accompanied by information other than the images. We consider two types of meta information supplied with the Amazon datasets (Fig. 2): 1) Tags: which identify the categories, the age range, the trademark, the event, etc.; 2) Text: which provides a description of the item in natural language. For both, we learn a unique vocabulary of tags and words across the dataset and represent each item using a bag of words representation. From thereafter, we can employ our NMF and forecasting models just as we do with our visual attributebased vocabulary. In results, we consider a text-only baseline as well as a multi-modal approach that augments our attribute model with textual cues. http://www.aleeshainstitute.com/ Visual Attributes are attractive in this problem setting for their interpretability, but how fully do they capture the visual content? To analyze this, we implement an alternative representation based on deep features extracted from a pre-trained convolutional neural network (CNN). In particular, we train a CNN with an AlexNet-like architecture on the DeepFashion dataset to perform clothing classification (see Supp. for details). Since fashion elements can be local properties (e.g., v-neck) or global (e.g., a-line), we use the CNN to extract two representations at different abstraction levels: 1) FC7: features extracted from the last hidden layer; 2) M3: features extracted from the third max pooling layer after the last convolutional layer. We refer to these as ClothingNet-FC7 and ClothingNet-M3 in the following. Forecasting results The textual and visual cues inherently rely on distinct vocabularies, and the metrics applied for Table 2 are not comparable across representations.
Nonetheless, we can gauge their relative success in forecasting by measuring the distribution difference between their predictions and the ground truth styles, in their respective feature spaces. In particular, we apply the experimental setup of Sec. 4.2, then record the Kullback-Leibler divergences (KL) between the forecasted distribution and the actual test set distribution. For all models, we apply our best performing forecaster from Table 2 (EXP). Table 3 shows the effect of each representation on forecasting across all three datasets. Among all single modality methods, ours is the best. Compared to the ClothingNet CNN baselines, our attribute styles are much more reliable. Upon visual inspection of the learned styles from the CNNs, we find out that they are sensitive to the pose and spatial configuration of the item and the person in the image. This reduces the quality of the discovered styles and introduces more noise in their trajectories. Compared to the tags alone, the textual description is better, likely because it captures more details about the appearance of the item. However, compared to any baseline based only on meta data, our approach is best. This is an important finding: predicted visual attributes yield more reliable fashion forecasting than strong real-world meta-data cues. To see the future of fashion, it pays off to really look at the images themselves. The bottom of Table 3 shows the results when using various combinations of text and tags along with attributes. We see that our model is even stronger, arguing for including meta-data with visual data whenever it is available.
Style discovery We use our deep model trained on DeepFashion  (cf. Sec. 3.1) to infer the semantic attributes for all items in the three datasets, and then learn K = 30 styles from each. We found that learning around 30 styles within each category is sufficient to discover interesting visual styles that are not too generic with large within-style variance nor too specific, i.e., describing only few items in our data. Our attribute predictions average 83% AUC on a held-out DeepFashion validation set; attribute ground truth is unavailable for the Amazon datasets themselves. Fig. 3 shows 15 of the discovered styles in 2 of the datasets along with the 3 top ranked items based on the likelihood of that style in the items p(sk|xi), and the most likely attributes per style (p(am|sk)). As anticipated, our model automatically finds the fine-grained styles within each genre of clothing. While some styles vary across certain dimensions, there is a certain set of attributes that identify the style signature. For example, color is not a significant factor in the 1st and 3rd styles (indexed from left to right) of Dresses. It is the mixture of shape, design, and structure that defines these styles (sheath, sleeveless and bodycon in 1st, and chiffon, maxi and pleated in 3rd). On the other hand, the clothing material might dominate certain styles, like leather and denim in the 11th and 15th style of Dresses. Having a Dirichlet prior for the style distribution over the attributes induces sparsity. Hence, our model focuses on the most distinctive attributes for each style. A naive approach (e.g., clustering) could be distracted by the many visual factors and become biased towards certain properties like color, e.g., by grouping all black clothes in one style while ignoring subtle differences in shape and material.
Style forecasting Having discovered the latent styles in our datasets, we construct their temporal trajectories as in Sec. 3.3 using a temporal resolution of months. We compare our approach to several well-established forecasting baselines, which we group in three main categories: Na¨ıve These methods rely on the general properties of the trajectory: 1) mean: it forecasts the future values to be equal to the mean of the observed series; 2) last: it assumes the forecast to be equal to the last observed value; 3) drift: it considers the general trend of the series. Autoregression These are linear regressors based on the last few observed values’ “lags”. We consider several variations : 1) The linear autoregression model (AR); 2) the AR model that accounts for seasonality (AR+S); 3) the vector autoregression (VAR) that considers the correlations between the different styles’ trajectories; 4) and the autoregressive integrated moving average model (ARIMA). Neural Networks Similar to autoregression, the neural models rely on the previous lags to predict the future; however these models incorporate nonlinearity which make them more suitable to model complex time series. We consider two architectures with sigmoid non-linearity: 1) The feed forward neural network (FFNN); 2) and the time lagged neural network (TLNN) . For models that require stationarity (e.g. AR), we consider the differencing order as a hyperparamtere for each style. All hyperparameters (α for ours, number of lags for the autoregression, and hidden neurons for neural networks) are estimated over the validation split of the dataset. We compare the models based on two metrics: The mean absolute error MAE = 1 n n t=1 |et|, and the mean absolute percentage error MAPE = 1 n n t=1 | et yt | × 100. Where et = ˆyt − yt is the error in predicting yt with yˆt. Forecasting results Table 2 shows the forecasting performance of all models on the test data. Here, all models use the identical visual style representation, namely our attribute-based NMF approach. Our exponential smoothing model outperforms all baselines across the three datasets. Interestingly, the more involved models like ARIMA, and the neural networks do not perform better. This may be due to their larger number of parameters and the relatively short style trajectories. Additionally, no strong correlations among the styles were detected and VAR showed inferior performance. We expect there would be higher influence between styles from different garment categories rather than between styles within a category. Furthermore, modeling seasonality (AR+S) does not improve the performance of the linear autoregression model. We notice that the Dresses dataset is more challenging than the other two. The styles there exhibit more temporal variations compared to the ones in Tops&Tees and Shirts, which may explain the larger forecast error in general. Nonetheless, our model generates a reliable forecast of the popularity of the styles for a year ahead across all data sets. The forecasted style trajectory by our approach is within a close range to the actual one (only 3 to 6 percentage error based on MAPE). Furthermore, we notice that our model is not very sensitive to the number of styles. When varying K between 15 and 85, the relative performance of the forecast approaches is similar to Table 2, with EXP performing the best. Fig. 4 visualizes our model’s predictions on four styles from the Tops&Tees dataset. For trajectories in Fig. 4a and Fig. 4b, our approach successfully captures the popularity of styles in year 2013. Styles in Fig. 4c and Fig. 4d are much more challenging. Both of them experience a reflection point at year 2012, from a declining popularity to an increase and vice versa. Still, the predictions made by our model forecast this change in direction correctly and the error in the estimated popularity is minor