Abstract: Video text extraction plays an important role in multimedia understanding and retrieval. Most of the previous research efforts are conducted within individual frames. In this paper, a novel solution based on hybrid architecture is proposed, which employs both the FRCNN and Skeletonization to provide a consistent end-to-end trainable text identification system. Object detection is made faster and more precise with the Faster R-CNN, which is an end-to-end CNN. While the original FRCNN utilizes the feature map from the final convolutional layer to generate regions of interest (RoI), the suggested method incorporates the skeletonization process to detect the specific text in the video frames. Results obtained from experiments performed on picture frames from a video dataset show that the given hybrid method based on FRCNN and skeletonization has increased detection scores while maintaining nearly the same detection speed when compared to other recent approaches.
Key Words: Text Detection, Skeletonization, Deep Learning, Data Augmentation, Localization, and Labelling
[1]. Bhunia, A. K., Konwer, A., Bhunia, A. K., Bhowmick, A., Roy, P. P., & Pal, U. (2019). Script identification in natural scene image and video frames using an attention based Convolutional-LSTM network. Pattern Recognition, 85, 172–184. https://doi.org/10.1016/j.patcog.2018.07.034
[2]. Cai, Y., Liu, C., Wang, W., & Ye, Q. (2020). Towards Spatio-Temporal Video Scene Text Detection via Temporal Clustering. http://arxiv.org/abs/2011.09781
[3]. Cheng, Z., Lu, J., Zou, B., Qiao, L., Xu, Y., Pu, S., Niu, Y., Wu, F., & Zhou, S. (2021). FREE: A fast and robust end-to-end video text spotter. IEEE Transactions on Image Processing, 30, 822–837. https://doi.org/10.1109/TIP.2020.3038520
[4]. Feng, W., Yin, F., Zhang, X. Y., & Liu, C. L. (2021). Semantic-Aware Video Text Detection. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1, 1695–1705. https://doi.org/10.1109/CVPR46437.2021.00174
[5]. Guan, T., Gu, C., Lu, C., Tu, J., Feng, Q., Wu, K., & Guan, X. (2022). Industrial Scene Text Detection with Refined Feature-attentive Network. IEEE Transactions on Circuits and Systems for Video Technology, 14(8), 1–13. https://doi.org/10.1109/TCSVT.2022.3156390