Deep Multi-modal Complex Event Detection: Fusing the power of events with deep learning

Multl-modal deep CEP
Architecture for deep multi-modal CEP

Deep learning has served object detection (computer vision) very well [2]. However, such models struggle when it comes to complex pattern identification within both spatial and temporal spaces [2]. Usually, object detection depends solely on using deep learning to identify objects in images. However, other data sources can be utilized to help make more accurate predictions and give context to the sensed images by surveillance cameras. For instance, to detect that some employee has left his workspace, the images from the respective camera should not detect the human object in the successive frames, related to the same person. However, the model can easily miss-report absence especially if the frame contains several persons. This can be further supported by readings from weight sensors mounted on the employee’s chair as well motion sensors, e.g. PIR (passive infra-red) can help better detect the event of employee absence. In this specific example, the employee absence is a complex event (CE) that can be detected by matching a pattern on one or more raw events. In our case, objects detected and identified by the camera, the readings from the other sensors. Yet, these events must be correlated both by time and location. This correlation logic is defined through pattern rules. Rules can be seen as regular expressions linking the raw events as well as more logical conditions on the attributes of raw events.


In this thesis, we want to combine the power of object detection via deep learning models with the expressiveness of pattern detection offered by complex event processing frameworks. The idea is: rather than train complex models, train simpler models and express pattern detection using complex event processing.


Supported scenarios are, but are not limited to:

  • Correlation of objects across streams from different cameras: a person moves from room 1 to room 2, we need to still identify that he's the same person.
  • Prediction of user movement direction: This can help simplify the data stream to look for a correlation within. If we have the plan of camera deployments, based on the movement of a person, we can predict which camera(s) she is expected to enter its monitored area.
  • Abandoned objects in sensitive areas? Whose object is that? We can identify the object's ownership to a person, based on analysis of past frames, and thus we can trigger an alert when the object is detected away from the person for at least X minutes of time.














[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

[2] T. Xing, M. R. Vilamala, L. Garcia, F. Cerutti, L. Kaplan, A. Preece, and M. Srivastava, "DeepCEP: Deep complex event processing using distributed multimodal information". In  Proceedings of 2019 IEEE International Conference on Smart Computing (SMARTCOMP). 2019, pp. 87-92