Abstract:
The objective of acoustic scene classification is to classify environments based on the sound events they produce. Acoustic scene classification has been used in a variety of applications, which include audio surveillance, assistive technologies like hearing aids and context-aware services. ASC is a challenging task due to the presence of similar sound events across acoustic scenes, causing high inter-class similarity. In this thesis, we approach this problem by providing a mechanism that helps in deriving discriminative features by suppressing certain sound events.
An acoustic scene can be viewed as a combination of background sound events and foreground sound events. Often, either the background or the foreground carries beneficial information in identifying the acoustic scenes uniquely. We propose to handle these similar sound events by utilizing a combination of methods that include robust principal component analysis (RPCA), subspace projection techniques and a self-attention network. These methods help in separating the background and the foreground sound events, and in partially removing the background (or foreground) sound events.
We employ the framework of RPCA to decompose the given acoustic scene into the background and the foreground sound events. RPCA decomposes a given data matrix into a low-rank and a sparse matrix. In the context of data describing an acoustic scene, the low-rank matrix represents the slow-changing background, and the sparse matrix represents the occasional foreground sound events. Further, we utilize a subspace projection technique named nuisance attribute projection (NAP) to reduce the inter-class similarity. NAP helps in partially removing the background (or the foreground) sound events by treating either the background (or the foreground) as nuisance variations. The nuisance basis for applying NAP are learned from the background and foreground separated data obtained post RPCA. These background-suppressed and the foreground-suppressed representations are combined using fusion techniques to improve classification accuracy. We also present an approach to incorporate the label information in the subspace projections by learning class-specific nuisance bases. Further, projection using these bases in combination with an attention mechanism is used for effective suppression, leading to better discrimination. Our results on standard datasets indicate that the proposed methods that use RPCA and subspace projections are indeed helpful in improving the classification accuracy.