Learnable Query Contrast and Spatio-temporal Prediction on Point Cloud Video Pre-training

Xiaoxiao Sheng; Zhiqiang Shen; Longguang Wang; Gang  Xiao

Authors

Xiaoxiao Sheng shanghai jiao tong university https://orcid.org/0000-0002-5031-795X
Zhiqiang Shen Shanghai Jiao Tong University https://orcid.org/0000-0002-3749-3115
Longguang Wang Aviation University of Air Force https://orcid.org/0000-0003-0429-0263
Gang Xiao Shanghai Jiao Tong University https://orcid.org/0000-0002-9259-6790

Keywords:

3D deep learning, point clouds, self-supervised pre-training, contrastive learning

Abstract

Point cloud videos capture the time-varying environment and are widely used for dynamic scene understanding. Existing methods develop effective networks for point cloud videos but do not fully utilize the prior information uncovered during pre-training. Furthermore, relying on a single supervised task with a large amount of manually labeled data may be insufficient to capture the foundational structures in point cloud videos. In this paper, we propose a pre-training framework Query-CP to learn the representations of point cloud videos through multiple self-supervised pretext tasks. First, tokenlevel contrast is developed to predict future features under the guidance of historical information. Using a position-guided autoregressor with learnable queries, the predictions are directly contrasted with corresponding targets in the high-level feature space to capture fine-grained semantics. Second, performing only contrastive learning fails to fully explore the complementary structures and dynamics information. To alleviate this, a decoupled spatio-temporal prediction task is designed, where we use a spatial branch to predict low-level features and a temporal branch to predict timestamps of the target sequence explicitly. By combining the above self-supervised tasks, multi-level information is captured during the pre-training stage. Finally, the encoder is fine-tuned and evaluated for action recognition and dynamic semantic segmentation on three datasets. The results demonstrate the effectiveness of our Query-CP. Especially, compared with the state-of-the-art methods, the fine-tuning accuracy on action recognition improves by 3.23% for 24-frame point cloud videos, and the mean accuracy increases by 4.21%.

Downloads

Download data is not yet available.

Author Biographies

Xiaoxiao Sheng, shanghai jiao tong university

Xiaoxiao Sheng is currently pursuing the Ph.D. degree in Shanghai Jiao Tong University, China. She received the master's degree with the School of Control Science and Engineering, Shandong University, China, in 2020. Her research interests include action recognition and video understanding.

Zhiqiang Shen, Shanghai Jiao Tong University

Zhiqiang Shen is currently pursuing the Ph.D. degree in Shanghai Jiao Tong University, China. He received the master's degree with the School of Control Science and Engineering, Shandong University, China, in 2018. His current research interests include self-supervised representation learning and point cloud understanding.

Longguang Wang, Aviation University of Air Force

Longguang Wang received the B.E. degree in Electrical Engineering from Shandong University (SDU), Jinan, China, in 2015, and the Ph.D. degree in Information and Communication Engineering from National University of Defense Technology (NUDT), Changsha, China, in 2022. His current research interests include low-level vision and 3D vision.

Gang Xiao, Shanghai Jiao Tong University

Gang Xiao received the Ph.D. degree from Shanghai Jiao Tong University, Shanghai, China, in 2005. He is currently a full professor with the school of aeronautics and astronautics, Shanghai Jiao Tong University, director of Advanced Avionics and Intelligent Information Laboratory. His current research interests include image fusion and target tracking, avionics integration and simulation. From 2008 to 2016, he had published 40 papers and 2 books. He received the title of Shanghai Pujiang talent in 2016. He is a member of China aviation society information fusion branch. He was a Visiting Scholar with Cranfield University, UK (2006), University of California, San Diego, USA (2010), Southern Illinois University Edwardsville, USA (2014-2015), respectively.

References

Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, and A. Markham, “RandLA-Net: Efficient semantic segmentation of large-scale point clouds,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 108–11 117, doi:10.1109/CVPR42600.2020.01112.

Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun, “Deep learning for 3D point clouds: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 12, pp. 4338–4364, 2020, doi:10.1109/TPAMI.2020.3005434.

H. Fan, X. Yu, Y. Ding, Y. Yang, and M. Kankanhalli, “PSTNet: Point spatio-temporal convolution on point cloud sequences,” arXiv preprint arXiv:2205.13713, 2022, doi:10.48550/arXiv.2205.13713.

H. Fan, X. Yu, Y. Yang, and M. Kankanhalli, “Deep hierarchical representation of point cloud videos via spatio-temporal decomposition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 9918–9930, 2021, doi:10.1109/TPAMI.2021.3135117.

H. Fan, Y. Yang, and M. Kankanhalli, “Point 4D transformer networks for spatio-temporal modeling in point cloud videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 204–14 213, doi:10.1109/CVPR46437.2021.01398.

Y. Wei, H. Liu, T. Xie, Q. Ke, and Y. Guo, “Spatial-temporal transformer for 3D point cloud sequences,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 1171–1180, doi:10.1109/WACV51458.2022.00073.

H. Fan, Y. Yang, and M. Kankanhalli, “Point spatio-temporal transformer networks for point cloud video modeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 2, pp. 2181–2192, 2022, doi:10.1109/TPAMI.2022.3161735.

Y. Wang, Y. Xiao, F. Xiong, W. Jiang, Z. Cao, J. T. Zhou, and J. Yuan, “3DV: 3D dynamic voxel for action recognition in depth video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 511–520, doi:10.1109/CVPR42600.2020.00059.

X. Liu, M. Yan, and J. Bohg, “MeteorNet: Deep learning on dynamic 3D point cloud sequences,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9245–9254, doi:10.1109/ICCV.2019.00934.

C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in Neural Information Processing Systems, pp. 5099–5108, 2017, doi:10.48550/arXiv.1706.02413.

H. Wang, L. Yang, X. Rong, J. Feng, and Y. Tian, “Self-supervised 4D spatio-temporal feature learning via order prediction of sequential point cloud clips,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2021, pp. 3762–3771, doi:10.1109/WACV48630.2021.00381.

Y. Dong, Z. Zhang, Y. Liu, and L. Yi, “Complete-to-partial 4D distillation for self-supervised point cloud sequence representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 661–17 670, doi:10.1109/CVPR52729.2023.01694.

X. Sheng, Z. Shen, G. Xiao, L. Wang, Y. Guo, and H. Fan, “Point contrastive prediction with semantic clustering for self-supervised learning on point cloud videos,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 16 469–16 478, doi:10.1109/ICCV51070.2023.01514.

Z. Shen, X. Sheng, H. Fan, L. Wang, Y. Guo, Q. Liu, H. Wen, and X. Zhou, “Masked spatio-temporal structure prediction for self-supervised learning on point cloud videos,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp.16 580–16 589, doi:10.1109/ICCV51070.2023.01520.

X. Sheng, Z. Shen, and G. Xiao, “Contrastive predictive autoencoders for dynamic point cloud self-supervised learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2023, pp. 9802–9810, doi:10.1609/aaai.v37i8.26170.

Z. Shen, X. Sheng, L. Wang, Y. Guo, Q. Liu, and X. Zhou, “PointCMP: Contrastive mask prediction for self-supervised learning on point cloud videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1212–1222, doi:10.1109/CVPR52729.2023.00123.

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-End object detection with transformers,” in Proceedings of the European conference on computer vision, 2020, pp. 213–229, doi:10.1007/978-3-030-58452-8 13.

B. Cheng, A. G. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” in Advances in Neural Information Processing Systems, 2021, pp. 17 864–17 875, doi:10.48550/arXiv.2107.06278.

J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping languageimage pre-training with frozen image encoders and large language models,” in International conference on machine learning, 2023, pp. 19 730–19 742, doi:10.48550/arXiv.2301.12597.

W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3D points,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2010, pp. 9–14, doi:10.1109/CVPRW.2010.5543273.

A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A large scale dataset for 3D human activity analysis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 1010–1019, doi:10.1109/CVPR.2016.115.

C. Choy, J. Gwak, and S. Savarese, “4D spatio-temporal convnets: Minkowski convolutional neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3075–3084, doi:10.1109/CVPR.2019.00319.

A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998– 6008, doi:10.48550/arXiv.1706.03762.

A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018, 2018, doi:10.48550/arXiv.1807.03748.

J.-X. Zhong, K. Zhou, Q. Hu, B. Wang, N. Trigoni, and A. Markham, “No Pain, Big Gain: Classify dynamic point cloud sequences with static models by fitting feature-level space-time surfaces,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8500–8510, doi:10.1109/CVPR52688.2022.00832.

H. Wen, Y. Liu, J. Huang, B. Duan, and L. Yi, “Point primitive transformer for long-term 4D point cloud video understanding,” in Proceedings of the European Conference on Computer Vision, 2022, pp. 19–35, doi:10.1007/978-3-031-19818-2 2.

Learnable Query Contrast and Spatio-temporal Prediction on Point Cloud Video Pre-training

Authors

Keywords:

Abstract

Downloads

Author Biographies

Xiaoxiao Sheng, shanghai jiao tong university

Zhiqiang Shen, Shanghai Jiao Tong University

Longguang Wang, Aviation University of Air Force

Gang Xiao, Shanghai Jiao Tong University

References

Downloads

Published

How to Cite

Issue

Section

Make a Submission

Information