DeepConvCVAE
VAE¶
A variational autoencoder (VAE)1 is a generative model in machine learning designed to encode input data into a latent space and subsequently decode it to reconstruct the original input. VAEs excel at generating new data samples that resemble a given dataset and are also valuable for various unsupervised learning tasks.
We utilised a variant of the variational autoencoder (VAE), the conditional VAE (CVAE)2, to simulate transcriptomics data conditioned on cell types. The CVAE allows for training simulators on highly dimensional, yet sparse single-cell gene expression data at a significantly accelerated speed. Additionally, it integrates various types of neural networks into the parameterized statistical inference process, offering greater flexibility than purely statistical inference methods for denoising multimodally distributed data.
Background: scRNA-seq count matrix simulation
The recent surge of interest in scRNA-seq stems from its ability to elucidate the correlation between genetics and diseases at the level of individual cell types with greater granularity. To discern the cell-specific gene expression landscapes, numerous scRNA-seq analysis tools have emerged. The reliability of these tools in practical applications hinges on their evaluation against realistically labelled data. Consequently, many simulators have been developed to ensure the availability of ground truth data. Most of these simulations are based on statistical inference techniques, using well-estimated parameters of probability distributions to characterise given scRNA-seq data. However, statistical inference methods have two limitations: long training time and a lack of denoising strategies. Firstly, methods in this category typically lack technical support from computing acceleration technologies, such as graphics processing units (GPUs), particularly when compared to deep learning computing libraries. However, the practical utility of simulated data varies across cell, tissue, and disease types, potentially necessitating model retraining to accommodate a range of application conditions. On the other hand, the inherent variability of complex biological systems often eludes capture by rigid statistical inference methods that rely solely on fixed probability distribution assumptions. This limitation may constrain their ability to effectively denoise data exhibiting multimodal distributions.
Point
A count matrix can be simulated on-demand.
Programming CVAE¶
We implemented a CVAE framework based on convolutional neural networks.
The following vignette shows the framework.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
|
Simulation¶
We have trained a model which we can use to simulate 100 cells (num_dots=100
) per cell cluster. The final simulated count matrix will be saved in h5
format.
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
console
KerasTensor(type_spec=TensorSpec(shape=(None, 144, 144, 1), dtype=tf.float32, name='encoder_input'), name='encoder_input', description="created by layer 'encoder_input'")
KerasTensor(type_spec=TensorSpec(shape=(None, 11), dtype=tf.float32, name='class_labels'), name='class_labels', description="created by layer 'class_labels'")
2024-07-30 22:41:02.531703: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
===>label 0
1/1 [==============================] - 0s 292ms/step
1/1 [==============================] - 0s 19ms/step
1/1 [==============================] - 0s 20ms/step
1/1 [==============================] - 0s 18ms/step
1/1 [==============================] - 0s 19ms/step
1/1 [==============================] - 0s 19ms/step
1/1 [==============================] - 0s 19ms/step
1/1 [==============================] - 0s 19ms/step
1/1 [==============================] - 0s 21ms/step
1/1 [==============================] - 0s 19ms/step
1/1 [==============================] - 0s 23ms/step
...
Training¶
We use 68k PBMCs scRNA-seq data downloaded from https://github.com/10XGenomics/single-cell-3prime-paper/tree/master/pbmc68k_analysis. It has the content.
-
fresh_68k_pbmc_donor_a_filtered_gene_bc_matrices.tar
-
68k_pbmc_barcodes_annotation.tsv
We use an in-house data processing method for filtering cells and genes that are not proper for analysis. This method can be programmatically accessed with uc.mat.data_in
. We can use it to read the data and further get the portions for training and testing.
Python
1 2 3 4 5 6 7 8 9 |
|
console
Train: (41147,) Test: (27432,)
Train: (41147,) Test: (27432,)
Train: (41147,) Test: (27432,)
Train: (41147,) Test: (27432,)
Train: (41147,) Test: (27432,)
['CD8+ Cytotoxic T', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD4+/CD45RO+ Memory', 'CD19+ B', 'CD4+/CD25 T Reg', 'CD56+ NK', 'CD4+ T Helper2', 'CD4+/CD45RA+/CD25- Naive T', 'CD34+', 'Dendritic', 'CD14+ Monocyte']
(41147, 144, 144, 1)
(41147,)
(27432, 144, 144, 1)
(27432,)
The following code will allow us to train the ConvCVAE framework. We can actually use any other types of training data, provided that you have tailored it to suit the input format (i.e. numpy.ndarray
).
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
console
KerasTensor(type_spec=TensorSpec(shape=(None, 144, 144, 1), dtype=tf.float32, name='encoder_input'), name='encoder_input', description="created by layer 'encoder_input'")
KerasTensor(type_spec=TensorSpec(shape=(None, 11), dtype=tf.float32, name='class_labels'), name='class_labels', description="created by layer 'class_labels'")
2024-07-30 23:13:52.493677: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Model: "ccvae"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
encoder_input (InputLayer) [(None, 144, 144, 1)] 0 []
class_labels (InputLayer) [(None, 11)] 0 []
encoder (Functional) [(None, 2), 917412 ['encoder_input[0][0]',
(None, 2), 'class_labels[0][0]']
(None, 2)]
decoder (Functional) (None, 144, 144, 1) 586465 ['encoder[0][2]',
'class_labels[0][0]']
dense (Dense) (None, 20736) 248832 ['class_labels[0][0]']
reshape (Reshape) (None, 144, 144, 1) 0 ['dense[0][0]']
concatenate (Concatenate) (None, 144, 144, 2) 0 ['encoder_input[0][0]',
'reshape[0][0]']
conv2d (Conv2D) (None, 72, 72, 16) 304 ['concatenate[0][0]']
conv2d_1 (Conv2D) (None, 36, 36, 32) 4640 ['conv2d[0][0]']
flatten (Flatten) (None, 41472) 0 ['conv2d_1[0][0]']
dense_1 (Dense) (None, 16) 663568 ['flatten[0][0]']
z_log_var (Dense) (None, 2) 34 ['dense_1[0][0]']
z_mean (Dense) (None, 2) 34 ['dense_1[0][0]']
tf.reshape_1 (TFOpLambda) (None,) 0 ['decoder[0][0]']
tf.reshape (TFOpLambda) (None,) 0 ['encoder_input[0][0]']
tf.__operators__.add (TFOp (None, 2) 0 ['z_log_var[0][0]']
Lambda)
tf.math.square (TFOpLambda (None, 2) 0 ['z_mean[0][0]']
)
tf.convert_to_tensor (TFOp (None,) 0 ['tf.reshape_1[0][0]']
Lambda)
tf.cast (TFOpLambda) (None,) 0 ['tf.reshape[0][0]']
tf.math.subtract (TFOpLamb (None, 2) 0 ['tf.__operators__.add[0][0]',
da) 'tf.math.square[0][0]']
tf.math.exp (TFOpLambda) (None, 2) 0 ['z_log_var[0][0]']
tf.math.squared_difference (None,) 0 ['tf.convert_to_tensor[0][0]',
(TFOpLambda) 'tf.cast[0][0]']
tf.math.subtract_1 (TFOpLa (None, 2) 0 ['tf.math.subtract[0][0]',
mbda) 'tf.math.exp[0][0]']
tf.math.reduce_mean (TFOpL () 0 ['tf.math.squared_difference[0
ambda) ][0]']
tf.math.reduce_sum (TFOpLa (None,) 0 ['tf.math.subtract_1[0][0]']
mbda)
tf.math.multiply (TFOpLamb () 0 ['tf.math.reduce_mean[0][0]']
da)
tf.math.multiply_1 (TFOpLa (None,) 0 ['tf.math.reduce_sum[0][0]']
mbda)
tf.__operators__.add_1 (TF (None,) 0 ['tf.math.multiply[0][0]',
OpLambda) 'tf.math.multiply_1[0][0]']
tf.math.reduce_mean_1 (TFO () 0 ['tf.__operators__.add_1[0][0]
pLambda) ']
add_loss (AddLoss) () 0 ['tf.math.reduce_mean_1[0][0]'
]
==================================================================================================
Total params: 1503877 (5.74 MB)
Trainable params: 1503877 (5.74 MB)
Non-trainable params: 0 (0.00 Byte)
__________________________________________________________________________________________________
18/1286 [..............................] - ETA: 2:58 - loss: 3376.0042
-
Cemgil, T., Ghaisas, S., Dvijotham, K., Gowal, S., & Kohli, P. (2020). The autoencoding variational autoencoder. Advances in Neural Information Processing Systems, 33, 15077-15087. ↩
-
Lopez-Martin, M., Carro, B., Sanchez-Esguevillas, A., & Lloret, J. (2017). Conditional variational autoencoder for prediction and feature recovery applied to intrusion detection in iot. Sensors, 17(9), 1967. ↩