Valence/Arousal Estimation of Occluded Faces from VR Headsets

—Emotion recognition from facial visual signals is a challenge which has attracted enormous interest over the past two decades. Researchers are attempting to teach computers to better understand a person’s emotional state. Providing emotion recognition can massively enrich experiences. The beneﬁts of this research for human–computer interactions are limitless. Emotions are intricate, and so we need a representative model of the full spectrum displayed by humans. A multi-dimensional emotion representation, which includes valence (how positive an emotion) and arousal (how calming or exciting an emotion), is a good ﬁt. Virtual Reality (VR), a fully immersive computer-generated world, has witnessed signiﬁcant growth over the past years. It has a wide range of applications including in mental health, such as exposure therapy and the self-attachment technique. In this paper, we address the problem of emotion recognition when the user is immersed in VR. Understanding emotions from facial cues is in itself a demanding task. It is made even harder when a head-mounted VR headset is worn, as now an occlusion blocks the upper half of the face. We attempt to overcome this issue by introducing EmoFAN-VR, a deep neural network architecture, to analyse facial affect in the presence of a severe occlusion from a VR headset with a high level of accuracy. We simulate an occlusion representing a VR headset and apply it to all datasets in this work. EmoFAN-VR predicts both discrete and continuous emotions in one step, meaning it can be used in real-time deployment. We ﬁne-tune our network on the AffectNet dataset under VR occlusion and test it on the AFEW-VA dataset, setting a new baseline for this dataset whilst under VR occlusion.


I. INTRODUCTION
Affective computing, the study of machines recognising human emotions [1], is a field of research that aims to enable intelligent systems to recognise, infer and interpret human emotions.It is an interdisciplinary field, bringing together researchers from computer science, psychology, and cognitive science.Facial Expression Recognition (FER), a type of affective computing, is a research discipline that began several decades ago.Facial expression conveys around 55% of the total information in communication [2].
Today, hundreds of companies are working on emotiondecoding technology, in an effort to teach computers how to Neophytos Polydorou was supported by the UKRI CDT in AI for Healthcare (Grant No. P/S023283/1).predict human behaviour.This technology has many uses, such as in the healthcare industry [3] and in games for children [4].Movie companies are using the technology to test volunteers' reactions to films [5].Car companies, such as BMW, Ford and Kia Motors, aim to use it to assess driver alertness [6].In addition, it is used by companies to measure users' emotions and hence their satisfaction with the quality of a service provided [7].Over the past two decades automatic emotion recognition has made significant progress.One of the largest challenges besetting automatic facial emotion recognition algorithms is that the developed frameworks have only been trained on data collected in controlled laboratory settings with frontal faces, perfect illumination and posed expressions [8].These algorithms are then applied to images and videos from the internet which have been captured in different unconstrained environments and the algorithms, unsurprisingly, do not perform as well.To deduce the affective state of a person, captured under real-world conditions, we require methods which can execute emotion analysis 'in-the-wild'.The term 'in-the-wild' implies a variety of head poses, illumination conditions, occlusions, different environments/scenes and background noise [8].
Deep learning techniques require a multitude of data to train on.In the past decade a large number of publicly available databases, particularly containing an enormous number of affective images collected from the real world, such as Af-fectNet [9], ExpW [10] and Aff-Wild2 [11]- [17], have become available.These are invaluable to the introduction and progress of deep neural networks for FER.A vast array of research has been conducted into FER using discrete emotions.The six universal basic emotions (e.g., happiness, fear, anger, disgust, surprise, sadness) were introduced by Ekman [18].However, it is conjectured that they are limited in how they represent the vast range of emotions which humans are able to display [19].
Today, researchers are increasingly focusing on a multidimensional emotion representation, built upon the circumplex model [20], proposed by Russell.This method has more granularity and shows more expressivity when representing emotional displays than the discrete model.The dimensions they investigate are valence, which describes how positive or negative an emotion is, and arousal, which describes how active or calm an emotion is.For valence-arousal recognition the optimisation becomes individual or combined regression of these two descriptors.
This paper addresses the problem of FER when the user is immersed in a virtual environment wearing a head-set.Most researchers work on emotion recognition from the entire user's facial expression.Few have analysed emotion recognition with occlusions, and even less have analysed occlusions applied around the eye region, intended to represent a virtual reality (VR) headset.
In the past decade, VR has increasingly emerged as a machine-generated environment with scenery and objects that are highly realistic, ensuring the user feels as if they are submerged in their fictional surroundings.This experience is perceived through a VR headset, a piece of equipment which can additionally use hand motion controllers.Many industries have already taken advantage of this new technology, such as medicine, culture, education and architecture.From the dissection of the human body, to guided museum visits, VR allows us to venture to places that would otherwise be unimaginable [21].FER in the context of VR now has a wide range of applications.We will next review applications of VR in mental health where emotion recognition platforms can play a key role in monitoring and understanding the patient's emotional experience.In fact, there is now a consensus that mental illness is associated with problems in self-regulation of emotions, and in many therapies treatment is focused on the patient acquiring an enhanced capacity for emotion selfregulation [22].

A. Application of VR in Exposure Therapy
There is a vast amount of evidence that virtual reality can be used to treat people with a variety of mental disorders.These range from eating disorders and substance-related addiction, to depression and post traumatic stress disorder (PTSD) [23]- [25].VR has been shown to help engage patients in their treatment [26].Exposure therapy is a psychological therapy developed to aid people in overcoming their fears.Throughout this intervention, users are gradually and repeatedly exposed to the situations and objects that they fear in a safe environment.This therapy has been shown to reduce mental suffering for individuals who experience incessant PTSD and anxiety disorders [27]- [29].
VR exposure therapy (VRET), an enhanced version of exposure therapy, has recently emerged as a very effective treatment for PTSD, social anxiety disorder, obsessivecompulsive disorder and various phobias [30].VRET allows a more personalised, regulated and engaging exposure to fearful situations.It is regularly preferred by therapists and individuals over imaginal exposure, where the patient must imagine the feared situations in their own mind, and over in vivo exposure, where the patient must face the feared situations in real life [27].In particular, in vivo exposures are extremely difficult to arrange due to safety policies and they take a lot of time to set up.There are several ways in which VRET can help the user.Firstly, as time passes the user will find that they weaken mental associations with the circumstances they fear.
Secondly, as the user continues to receive this therapy they can learn that they are able to deal with their feelings and overcome their fears through repeated exposure to difficult situations.Finally, through repeated exposure users can begin to attach more beneficial associations to feared circumstances [30].One of the largest benefits of VR is that users are aware that a simulated situation is not real, however they react as if it is.Users find it easier to try out and practice their new remedial techniques for difficult scenarios in a virtual environment rather than in reality.The user can then take these lessons practiced in VR, and implement them in the real world [31].Therefore, one of the main and fundamental applications of our work is in the use of VR for psychotherapy, in particular, exposure therapy.VRET is beneficial for both the therapist and the patient.The therapist can personalise the content and view exactly what the patient is experiencing and so talk them through how to react to each specific situation [26].In addition, therapists can observe if patients are paying attention to the simulation.Here again emotion recognition can help the therapist to track how the patient reacts to different parts of the simulation, which will help to improve the experience in future.Lindner et al. researched the effectiveness of using consumer VR equipment to carry out exposure therapy for glossophobia, the fear of public speaking [32].They found that the improvements from self-led exposure on the VR platform were of a similar magnitude to therapist-led exposure.
Freeman et al. [31] emphasised in their review that symptoms of mental disorders can be evaluated in VR, but so far there has been a lack of tests of dependability and efficacy.Emotion recognition of the patient has the potential to massively help to improve the reliability of symptom evaluation.A number of people do not find current virtual environments fully immersive [33].Emotion recognition algorithms used within the VR environment can track the way that users are feeling throughout the experience and allow the experience to change to better suit the users emotions and reactions.This in turn can help to fully immerse the user in the environment.On top of VRET, there are many other uses for emotion recognition with a VR headset, such as a more interactive gaming experience by tailoring the affective game experience to the individual user, i.e. the gaming experience is personalised while the game is played [34].Another use is for training the military in simulated combat environments and tracking their emotions throughout to see how the soldiers respond to different situations [21].

B. Application of VR in Self-Attachment Technique
One other critical application of this work is in the use of VR for the Self-Attachment Technique (SAT).The SAT is a newly proposed self-administrable psychotherapeutic intervention [35] that is based on Bowlby's attachment theory [36].It suggests that the personal development of an individual depends highly on the childhood type of attachment with their primary caregivers.A secure attachment to a caregiver provides a necessary sense of security and a foundation that enables the child to take risks, grow and develop their personality.Previous studies [37], [38] identified four major styles of attachment: (1) secure attachment (children who can depend on their caregivers), (2) avoidant-insecure (children who tend to avoid parents or caregivers, probably as a result of abusive or neglectful care), (3) ambivalent-insecure (children who become very distressed when a parent leaves as a result of poor parental availability) and (4) disorganised-insecure (children who display a confusing mix of behaviour most likely linked to inconsistent behaviour by their caregiver).The SAT intervention aims to help the individual overcome any early insecure attachments [36] and thus be able to selfregulate their emotions in order to deal with challenging life situations.The SAT is a double-role game where the individual plays both the role of the adult-self, that represents their rational self, and the role of the childhood-self, that represents their emotional self.During the SAT sessions, the individual, as an adult-self, uses photos of their childhood to imaginatively parent their childhood-self thus forming an internal secure bond between the two.
Due to its self-administrable nature, the SAT can be enhanced by technological tools and automated procedures.It is shown that the interactions between the childhood-self and the adult-self of the individual can take place in a virtual environment [39].Polydorou and Edalat [39] present an immersive VR platform where a photo-realistic child avatar exists, representing the childhood-self of the user.This replaces the imaginative processes required by the SAT protocol.Furthermore, a virtual agent guides the user through the sessions, thus making the procedure fully automated and available to the user from the comfort of their home.In addition to various functionalities that allow the interactions with the environment to be user-friendly, the platform includes an audio-based emotion recognition algorithm.Emotion recognition capabilities are crucial for the personalisation of the SAT platform and for the effectiveness of the intervention.Their current emotion recognition algorithm is an end-to-end neural network that can predict six discrete emotions (happy, sad, fear, disgust, surprise and anger) from the speech input of the user (audio modality).We surmise that by incorporating the visual modality in this VR framework and exchanging the discrete emotion predictions for continuous emotion predictions, we can enhance the emotion recognition performance and thus the experience of the user.

C. Outline of our results
In this paper we extend the current implementation of full face emotion recognition to solve the partial face problem in the continuous emotional domain.This problem is caused during the intervention by the VR head-mounted display (HMD) covering part of the face, thus reducing the number of features available for emotion prediction.We predict a valence and arousal value for each image, where each facial image has a large simulated occlusion around the users eye region representing a VR headset.Training an emotion recognition model from scratch requires extremely long training times and enormous amounts of data, therefore we utilise transfer learning to further train on EmoFAN [40], a state-of-the-art network, trained on the entire user's facial expression.The prospective outcome of this work is to create an algorithm that can be applied to a fully immersive virtual environment, where the algorithm can determine the emotion of the user at all times.This emotion recognition tool based on the visual modality can be used during both exposure therapy and the SAT intervention.This has the potential to help each individual receive a superior personalised experience.

II. RELATED WORKS
In recent years there have been a multitude of ground breaking papers on deep neural networks.This has revolutionised the machine learning world and networks' abilities to learn ever more complex and relevant features in data in a hierarchical manner [41].It has had a huge impact on emotion recognition as well.Li et al. [41] concisely describe the general pipeline for automatic deep FER: (1) frame preprocessing, which can include face detection and alignment, data augmentation and face normalisation, (2) deep feature learning, which can be any of convolutional neural networks, deep belief networks, deep autoencoders, recurrent neural networks and generative adversarial networks and (3) deep feature classification, where we classify each image into one of the basic emotion categories, regularly using softmax loss to minimise the cross-entropy between estimated class probabilities and the ground-truth distribution.
While there are a vast array of publications on discrete emotion estimation, in this work we focus on the continuous emotional dimensions, specifically valence and arousal, which encode small changes in the intensity of emotions [19], [20].Gunes et al. [42] give a comprehensive overview of emotion representation in a continuous space in their survey paper.Several papers in the literature conclude that this continuous representation is a far better representation of emotions.Eerola et al. [43] explain that one of the largest differences between the continuous and the discrete models is that the discrete model performs far worse at characterising cases that are more emotionally ambiguous.A good example are the emotions 'contempt' and 'anger', which are extremely visually similar.If the model predicted 'anger' when in fact the ground truth label was 'contempt', then even though these are near identical emotions on a valence-arousal scale, this would be backpropagated as an incorrect prediction in a discrete prediction model.With a valence-arousal model however, as the two discrete emotions are close in the 2-D Euclidean Space, the algorithm can understand a lot more about the similarities and differences between emotions and it can become much more expressive in its learned features.Yu et al. [44] came to a similar conclusion.
One of the largest challenges for accurate FER outside of laboratory environments today is partial occlusion of the face [45].Many different objects such as sunglasses, hats, hair and many others can occlude parts of a face.Occlusions make it far more difficult to extract discriminative features from faces, mainly due to inexact face alignment, incorrect feature location and face registration error [46].Zhang et al. [45] conducted a facial expression analysis (FEA) survey for faces under partial occlusion.This is to their knowledge the first of its type for partially occluded faces and it is essential reading for any researcher looking to work in the field of FER with occluded faces.Their survey found that the mouth is the most critical region of the face for FEA.Occlusions of the mouth have enormous effects on the classification of the six basic emotions.They conclude that the eyes are the second most significant region of the face.Occluded eyes have a great effect on the classification of surprise, disgust, sadness, and a moderate effect on anger, but only a negligible effect on happiness.Occlusions of the eyes are of particular importance to this work because we are attempting to recognise facial expressions while a user wears a VR headset, which occludes the upper half of the face, including the eyes.Lian et al. [47] experimented on the ExpW dataset [10], focusing on how different facial features effect the overall result of an emotion prediction algorithm.They investigated the way different facial features effect the prediction of each emotion.Their work clearly emphasises that the emotions with the largest arousal values, i.e. surprise, fear and anger all seem to rely more on the eyes for their predictions.Their work highlights that emotion recognition algorithms focus more on the eyes for arousal prediction and more on the mouth for valence prediction.
Hinton et al. [48] proposed a deep generative model that used deep belief networks to model pixel-level features.The features which it learns are good for discriminating facial expressions, and by exploiting the generative ability of the model, it is possible to deal with occluded regions by filling them in.Xu et al. [49] proposed a FER model based on transfer learning from two trained deep convolutional networks, which were pretrained on different datasets, one with added occluded samples.Recently, Li et al. [50] designed a CNN with attention (ACNN) for FER in the presence of occlusions.The ACNN enables the model to shift attention from occluded patches to other unobstructed as well as discriminative facial regions.
Georgescu and Ionescu [51] focused on FER in a VR setting, which meant there was a severe occlusion as the user is wearing a HMD.They train their neural networks on modified training examples by intentionally occluding the upper half of the face.This approach forces the neural networks to focus on the lower half of the face for predictions.Houshmand et al. [52] followed on from the work above.They identified that a distinctive characteristic of the standardised occlusion arising from a VR headset is that it is simple to mathematically model.As commodity VR headsets are a recognised shape and size, it is possible to simulate an occlusion representing these headsets.This is a simpler and faster procedure than attempting to collect a brand new dataset of people wearing these headsets.Using this approach, Houshmand et al. simulated occlusions from VR headsets, while simultaneously using transfer learning to take advantage of pre-trained FER networks.A key feature of both of these approaches is that they focus only on discrete emotions.In this work we strive to expand into the continuous emotional domain.To our knowledge our work is the first study to focus on FER under VR occlusions with a continuous emotional representation.

III. ETHICS AND APPLICATIONS
It is critical at this point to consider and discuss the potential repercussions of this work, in particular the social, legal, environmental and ethical issues.Firstly, implementing machine learning (ML) techniques can require a large amount of computer resources and long training times.These machines use a large amount of electricity, which has a negative environmental impact.To combat this, the code was designed and always tested on smaller data subsets before it was implemented on a large dataset, to prevent errors occurring and therefore wasting valuable resources.
Secondly, as with most ML algorithms, there is always a risk of bias when using specific datasets.Datasets can lack a varied representation of the underlying population.Specifically when it comes to emotion recognition, there can be further bias as there can be cultural differences in the way we express emotions [53].Overall, it means that picking a dataset which under-represents a certain demographic can lead to the algorithm not working effectively for people of that heritage.This occurs as the algorithm was never trained to understand how facial expressions relate to emotions in that culture.A famous example of bias in machine learning is ImageNet [54], which is a very large database with over 14 million European American images.It is the baseline dataset which many AI algorithms are trained on.In 2019, ImageNet roulette [55] came out, which emphasised and exposed how systemic biases in the field of facial recognition have been passed onto machines by the scientists who trained their algorithms.To circumvent these issues for non-European American cultures it is critical to retrain the models published here on local data to attenuate bias.
Lastly, it is imperative to consider how this technology may be misused by individuals, corporations or even governments.This technology, which can read your emotions from your facial expressions alone, has many potential negative applications.Examples include remote-interviewee analysis, manipulating customers based on their reactions to advertising, to more sinister uses, such as bombarding users with propaganda to sway their political beliefs.Wherever this technology is in use, it is imperative that the correct frameworks are in place to rectify under-represented demographics being discriminated against, consistent with the approach that biometric technologies are taking [56].

A. Processing Input: Simulation of Virtual Reality Occlusion
For face detection, face alignment and the detection of 68 coordinates of facial landmarks we used Tzimiropoulos and Bulat's Pytorch implementation [57].We can see the implementation of the landmarks for an image from AffectNet [9] in Fig. 1.There is no public standard occlusion face image database containing people wearing VR headsets.Therefore we created VR-occluded images by masking the upper region of the FER database AffectNet [9].Following face detection, alignment and landmark detection, we applied a VR patch, which represents the headset.To implement a facial occlusion representing a VR headset, we followed a procedure very similar to Houshmand et al. [52].
The model was based around the Oculus Quest 2 [58] with dimensions of approximately 170 x 90mm.We could not use a fixed size covering as different faces fill different percentages of images.The separation of the two temporal bones of the facial landmarks was used as a reference to scale the VR patch.The midpoint of the line progressing through the eye centre points was used as the central coordinate of the VR headset.From this, we generated the polygonal occluding patch.Furthermore, we aligned the patch with the axis passing through the eye centres to obtain a more accurate representation of a VR headset applicable to a variety of face rotations.To procure the angle of incline, we deciphered the inverse tangent of changes in y-coordinates to changes in x-coordinates of the points of the eye centres.We then used the rotation matrix to rotate the corner points of the patch around its central pivot point on the coordinate plane.The overall geometric model created ensured that the implemented occlusion was x, y aligned, scale aligned as well as rotation aligned.Houshmand's idea [52] is incredibly simple but elegant.
Finally, if a face is detected by the face detection algorithm, then a bounding box is applied around this face.It is this image of the face inside the bounding box, with the VR patch applied, which is used to train and test the model.Fig. 2 exhibits the simulation of a VR occlusion and the application of a bounding box.

B. Transfer Learning
The parameters of a neural network are generally initialised to random values, when a new task is learned.However, this approach is not efficient when it comes to learning a task where a very similar task has already been learnt.We as humans have the ability to transfer knowledge from one domain to another which we have not experienced before.Transfer learning takes a model trained on one task and repurposes it for a second related task.
With computer vision, if our task is to recognise emotions from faces, we can initialise our network with parameters from a second network originally designed to recognise faces.This approach can enormously accelerate training over randomly initialising parameters.The reason is that there are millions of parameters to train and training these all from scratch is costprohibitive.Shin et al. [59] have shown that using transfer learning with these pre-trained model parameters is a great starting point for other computer vision tasks.This is because in images objects share low-level features obtained by the lowest level filters of the neural network.The upper layers, most commonly dense fully-connected layers, are usually exchanged with randomly initialised layers.Throughout training the lower layers are generally "frozen" to prevent their parameters changing.This ensures that low-level features learned on the original dataset are maintained.It is solely the final layers that are then freely trained to take the extracted lowlevel features in, and learn the actual classification/ regression task.
For predicting continuous valence and arousal values for facial expressions under VR occlusions we used transfer learning with EmoFAN [40].EmoFAN is a state-of-the-art emotion predicting convolutional neural network.It uses a pioneering approach for continuous valence and arousal estimation from facial images recorded in-the-wild.It surpasses state-of-theart methods by an enormous margin, such as on the SEWA [60] and the AFEW-VA [61] databases.EmoFAN continues on from work on the face-alignment network (FAN) [57], which extracts features of the facial landmark estimation.The network is split into several segments.The first segment contains convolutional blocks which extract shallow features of identical resolution to the input image.The second segment comprises two hourglasses that compute features related to the facial landmark estimation task.The last segment is made up of convolutional blocks and fully connected layers which extract features for the emotion prediction.

V. EXPERIMENTAL RESULTS
The aim of our experiments was to learn to ignore the eye region of the face which would be covered by a VR headset occlusion, during emotion prediction.
A. Datasets 1) AffectNet: EmoFAN [40] was trained on the entirety of the AffectNet dataset [9].AffectNet is a visual only modality facial expression dataset of spontaneous affect in-the-wild, which contains over one million images sourced from the Internet.These were obtained by searching 1,250 emotionrelated keywords in six different languages from search engines such as Google, Bing, and Yahoo.It is by far the largest database which provides facial expressions in two different emotion models (categorical model and dimensional model).To this end, 450,000 images have manually annotated labels, generated by twelve full-time and part-time annotators at the University of Denver, for eight basic emotional classes as well as other categories related to the intensity of valence and arousal.This dataset is varied in its subject demographic.A subset of AffectNet [9], of 291,650 images with valence and arousal annotations, was used for further training the visual modality of the emotion recognition model in this work.This included a train set of 287,651 images and a test set of 3,999 images.
2) AFEW-VA: AFEW-VA is a small database created by Kossaifi et al. [61].It contains 600 video clips, encompassing 30,000 annotated frames.These have been extracted from feature films such as Harry Potter, 21 and Ocean's Eleven.They simulate real world conditions such as hybrid occlusions, a variety of illuminations and a vast array of movements by the subjects of the videos.The database contains videos which range from 10 frames to less than 200 frames.It also includes annotations of valence and arousal for every single frame with discrete values in the range of (+10, -10).

B. Performance measures
The performance metrics we used to decipher how well our model performed in this work are: Root Mean Squared Error (RMSE), which must be minimised, the Pearson Correlation Coefficient (PCC) and Lin's Concordance Correlation Coefficient (CCC) [62], which must both be maximised.The notation used in the definitions is as follows: Y is the predicted label, Ŷ is the ground-truth label, and µ Y and σ Y are the mean and standard deviation of Y.
RMSE assesses how near the predicted values are to target values: with the sample estimate: PCC deciphers how correlated the predictions are to the target values: with the sample estimate: CCC is a metric regularly used in dimensional emotion recognition.It measures the agreement between two variables; in this case it measures the agreement between the true emotion dimension and the predicted emotion dimension.It is an amalgamation of the Pearson's correlation and the bias between both sequences.If the predictions deviate in value, the CCC score is reduced in proportion to the shift [62].In essence, it penalises correlated signals with different means, which PCC does not.This results in CCC being more reliable than Pearson correlation and RMSE to evaluate the performance of multi-dimensional emotion recognition.

C. Training Implementation
We have chosen to use Pytorch [63], a Python machine learning framework, as EmoFAN was built using this framework.All our training occurred on either a NVIDIA TITAN Xp or a NVIDIA Tesla T4 GPU.Even using these powerful GPUs, the total training time per model was around 14 hours.We used the Adaptive Moment Estimation (ADAM) optimiser [64], which uses adaptive learning rates for all the different parameters which exist in the model.Hyper-parameters were validated using a randomised grid search.We tested the batch size, in the range 8 to 256 and the learning rate in the range 1 × 10 −4 to 1 × 10 −2 .We found that a batch size of 32 and a learning rate of 1 × 10 −3 were the best for our training purpose.We set shuffle to True in our Pytorch Dataloader function when training, to ensure that the algorithm did not just learn patterns based on the order of the samples it was trained on.The input to our neural network was set to 256 × 256 pixels, therefore all images were scaled to this size.All images were randomly horizontally flipped with a probability of 0.5.Our code is publicly available [65].

D. Loss Functions
The loss functions we have chosen to predict affect dimensions are the Valence-Arousal losses of CCC, PCC and RMSE: Each loss function encodes important information about the task at hand.For our task we are focused on maximising the correlation coefficients between the prediction and the ground truth.This is done through PCC and CCC.In addition, RMSE leads to a lower prediction error and so a higher accuracy.So we need a balance of each of these loss functions.
The Valence-Arousal losses can also be regularised with shake-shake coefficients, α, β and γ.Each coefficient is chosen uniformly and randomly in the range (0,1) each time the loss function is called.Overall, this prevents the network focusing all its attention on minimising only one of the three losses.

E. Experimental descriptions and Results
Our aim was to learn to ignore the eye region of the face which would be covered by a VR headset occlusion.We propose the EmoFAN-VR algorithm for emotion detection, trained to solve the partial face problem.
1) AffectNet: Using the Valence-Arousal losses we further trained the convolutional blocks and fully connected layers, which extract features for the estimation of affect of the EmoFAN network.We trained using images from AffectNet with VR occlusions applied to every image.To solve the problems of overfitting that we faced in our early experiments, we implemented several regularisation techniques.Firstly, we included Shake-Shake regularisation on the Valence-Arousal losses.Moreover, we added dropout, with a probability of an element to be zeroed set to 0.5, in the fully connected layers.The EmoFAN-VR algorithm was based on this training.
In addition to the Valence-Arousal losses, in a separate experiment, we attempted to train with a cross entropy loss as well, representing discrete label loss over the eight discrete emotions (Eckam's original 6 with neutral and contempt): ŷi log(y i ) (10) However, this led to no great improvement in results.Furthermore, the EmoFAN network uses an attention mechanism that drives the focus of the network to regions on the face that are the most relevant for emotion estimation.The attention mechanism is implemented as a multiplication of the 68-point facial landmarks with the features extracted at different levels in the FAN.We experimented, as can be seen in Fig. 3, by ignoring the landmarks associated with the Right Eyebrow, Left Eyebrow, Right Eye and Left Eye, as these are all covered by the VR headset.The aim was that the algorithm would use tailored attention maps by ignoring certain facial landmarks and therefore focus more attention on landmarks in the lower parts of the face for emotion estimation.This once again however led to only minor improvements on top of our EmoFAN-VR algorithm.
As the results from the experiments of adding an extra loss function and ignoring certain landmarks both led to fairly similar results to our original EmoFAN-VR algorithm, we decided to stick with our initial most simply trained model Fig. 3. Example of an image with all the landmarks on left and with the landmarks around the eye regions ignored on right and follow Occam's razor [66].This principle propounds that we should choose simpler models over more complex models.In essence, it is a heuristic that implies that more complex hypotheses make more assumptions, and as a result they will be too narrow in their scope and will not generalise well to new data.Results of the original EmoFAN and of our EmoFAN-VR algorithms on the AffectNet dataset [9] with VR occlusions applied are given in Table I.EmoFAN-VR shows a vast improvement over the original EmoFAN algorithm, with over a 21% increase in arousal CCC, a 6.4% increase in arousal PCC and a 3.7% increase in valence CCC.
2) AFEW-VA: We evaluated the EmoFAN-VR algorithm on the entire AFEW-VA dataset [61] with VR occlusions applied to every image in the dataset.As far as we are aware, after extensively searching through the literature and contacting the original authors of the AFEW-VA paper [61], this is the first time the AFEW-VA dataset is being used to test an algorithm with systematic occlusions covering the eye region, representing virtual reality headsets.
Fig. 4 represents the two-dimensional joint distribution of arousal and valence values showing the ground truth distribution and the predictions distribution from the EmoFAN-VR algorithm.It is clear that the predictions distribution is close to the ground truth distribution.Nevertheless, EmoFAN-VR does slightly struggle with more negative arousal values, and with extreme positive arousal and negative valence together.AffectNet [9], the dataset this network is trained on, has fewer examples in both these regions, in particular for arousal values less than -0.3 and this may explain why the network is poorer at predicting values in these regions.In addition, Zhang et al. [45] showed occluded eyes have a large effect on the classification of sadness and disgust, and a moderate effect on anger.Sadness is the only discrete emotion present at extreme negative arousal, and anger and disgust have an extreme positive arousal and negative valence.As these emotions rely considerably on the eyes, it is logical that the algorithm struggles to accurately predict the correct valence and arousal values in these regions.
In Table II, we can see the results of both the original Emo-FAN algorithm and the results of the EmoFAN-VR algorithm on the whole AFEW-VA dataset with VR occlusions applied.The EmoFAN-VR algorithm outperforms the EmoFAN algorithm, on the AFEW-VA dataset with VR occlusions, by a very large margin on all metrics.Valence CCC improves over 44%, valence PCC over 46% and valence RMSE over 13%.Arousal CCC improves over 46%, arousal PCC by close to 23% and arousal RMSE over 2.5%.This result is a new baseline for the AFEW-VA dataset with VR occlusions applied.
What makes this result even more remarkable is that the EmoFAN-VR algorithm was not fine-tuned on the AFEW-VA dataset at all.This shows that the EmoFAN-VR algorithm generalises well to new unseen data.

F. Research Limitations and Future Work
One of the main limitations of the EmoFAN-VR algorithm, fine-tuned on occluded data, is that it was only trained on a subset of AffectNet [9].This is the case because only a subset was available at the time.In future work, it would be an enormous benefit to train on the entire AffectNet dataset rather than just a subset.In addition, we could try a more in depth hyperparameter search experimenting with more learning rates, loss functions and other regularisation techniques.
Another interesting improvement, implemented by Handrich et al. [67], is to train on several different datasets, in order to more broadly train over the valence-arousal space.Handrich et al. showed that results on the AFEW-VA [61] dataset benefited from fusing training samples from a variety of datasets.Furthermore, if the datasets are well selected to cover a variety of different ethnicities this can attenuate cultural bias.
To fully understand human emotions in realistic situations relies on encoding from different modalities, of which the facial expression is only one.A further improvement is to fuse several modalities with the visual modality such as audio, text and heart rate.

VI. CONCLUSION
The principal aim of this work was to create an algorithm for emotion detection, which can be made available during a virtual reality experience, such as exposure therapy or the self-attachment technique.We aimed to do this by extending the current implementation of full face emotion recognition to solve the partial face problem.To achieve this aim, we created an adapted and enhanced version of EmoFAN, the EmoFAN-VR algorithm, specialised in detecting emotions from faces  occluded by a VR headset.We experimented with several loss functions, learning rates and batch sizes, and we used several different regularisation techniques.These investigations resulted in the EmoFAN-VR algorithm that improved on VR occluded AffectNet [9] by over 21% relative to EmoFAN for the arousal CCC metric.
Using in-the-wild data, rather than data accumulated under laboratory conditions, to train EmoFAN-VR, we expected it to generalise better to new unseen data.We proved this right when we tested it on the AFEW-VA dataset [61] with VR occlusions applied.There was a significant improvement over the original EmoFAN algorithm across the board, with over a 45% improvement on certain metrics.This result is a new baseline for the AFEW-VA dataset with VR occlusions applied.The outcome is a significant result as the model's purpose is to detect the emotional state of the user while wearing a VR headset.Overall, the addition of this work within the VR realm is likely to have large benefits for users looking for a more personalised experience.

TABLE I RESULTS
ON THE AFFECTNET DATASET WITH VR OCCLUSIONS

TABLE II RESULTS
ON THE AFEW-VA DATASET WITH VR OCCLUSIONS