Activity, Event and Action Databases
1. RPI-ISL Activity Datasets
a. Parking Lot Dataset: (License)
The dataset consists of 108 sequences for 7 actions captured from a parking lot. The actions includes walking, running, leaving car, entering car, bending down, throwing and looking around. These action examples are performed by two people with scale variation, view change and shadow interference.  


# Examples





Leaving Car


Entering Car


Bending Down




Looking Around


*** This dataset is available upon request. Please contact Prof. Qiang Ji at
b. Complex Activity Dataset:
The complex activity dataset consists of 15 video sequences for 5 complex activities in daily life: Shaking hands, Talking, Chasing, Boxing and Wrestling. These activities are conducted by two interactive subjects who perform 5 basic individual actions: Standing, Running, Making a fist, Clinching, and Reaching out. The following table describes the basic actions performed by the two subjects for each complex activity. In each sequence, the 5 complex activities are sequentially performed, so there are 15 examples for each complex activity.


Action of subject 1

Action of Subject 2

Shaking hands

Reaching out

Reaching out








Making a fist

Making a fist




c. VIRAT-like Complex Activity Dataset:
The Virat-like dataset was captured from the 4th floor stairwell of the parking garage overlooking the metered parking lot.  Several actors with three different vehicles performed scripted person-vehicle activities while also capturing background activities of pedestrians.  An HD camcorder with a 1920x1080 resolution was used to capture 3 hours of video for the following actions:

-   Person getting into vehicle

-   Person getting out of vehicle

-   Person loading vehicle

-   Person unloading vehicle

-   Person opening trunk

-   Person closing trunk

-   Person walking

-   Person carrying object

-   Vehicles Parking

-   Vehicle picking person

-   Vehicle dropping off person

-   Person exchanging object from one vehicle to another

-   Person handing an object to another person

-   Group of people gathering

-   Group of people dispersing

-   Group of people moving


-   The original video files are in MTS format; but were converted to mpeg for input to VIPER for annotations

-   The annotated files are in the xgtf (xml) format where only two of the eight video clips have been partially annotated

o  Rpi_dc1_4_7_10_00006.mpeg and _00007.mpeg are partially annotated

o  Only the starting bounding box and the temporal labels of each person-vehicle activity is labeled

o  There are some single person and group activities partially labeled; but that was not the focus at the time.

2. Weizmann Dataset
90 low-resolution(180x144) video clips with 9 different subjects, each of which performs 10 basic actions.
Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, Ronen Basri: Actions as Space-Time Shapes, ICCV2005.
3. KTH Dataset 
6 human actions performed several times by 25 subjects in four different scenarios. It totally contains 2391 sequences.
Christian Schuldt, Ivan Laptev and Barbara Caputo: Recognizing Human Actions: A Local SVM Approach, ICPR2004.
4. UCF Datasets
a. UCF Aerial Action Dataset
This dataset features video sequences that were obtained using a R/C-controlled blimp equipped with an HD camera mounted on a gimbal.The collection represents a diverse pool of actions featured at different heights and aerial viewpoints. Multiple instances of each action were recorded at different flying altitudes which ranged from 400-450 feet and were performed by different actors.
Walking, Running, Digging, Picking up object, Kicking, Opening car door, Closing car door, opening car trunk, closing car trunk.
Mpg for video, xgtf for annotations, and zip for balloon parameters.
Clip Example :
b. UCF ARG dataset
Details: Surveillance dataset. Captured from three views: Aerial, Rooftop and Ground.
Actions: 10 types of actions: Boxing, carrying, clapping, digging, jogging, opening-closing trunk, running, throwing, walking and waving.
c. UCF Sports Action dataset
200 sequences for 9 sports activities at a resolution of 720x480.
This dataset consists of a set of actions collected from various sports which are typically featured on broadcast television channels such as the BBC and ESPN. The video sequences were obtained from a wide range of stock footage websites including BBC Motion gallery, and GettyImages.
This new dataset contains close to 200 video sequences at a resolution of 720x480. The collection represents a natural pool of actions featured in a wide range of scenes and viewpoints. By releasing the dataset we hope to encourage further research into this class of action recognition in unconstrained environments.
Diving (16 videos)
Golf swinging (25 videos)
Kicking (25 videos)
Lifting (15 videos)
Horseback riding (14 videos)
Running (15 videos)
Skating (15 videos)
Swinging (35 videos)
Walking (22 videos)
If you use this dataset, please refer to paper:
Mikel D. Rodriguez, Javed Ahmed, and Mubarak Shah, Action MACH: A Spatio-temporal Maximum Average Correlation Height Filter for Action Recognition.
d. UCF11 (Youtube Action) dataset
11 action categories from the Youtube videos. Each of the actions is grouped into 25 groups with more than 4 action clips in it. All actions are annotated using the VIPER format.
e. UCF50 & UCF101 datasets
Both datasets are extentions of UCF11 (Youtube Action) dataset. The UCF50 dataset is an action dataset with 50 action categories from realistic videos taken from Youtube. The UCF101 dataset further extends the UCF50 dataset to 101 action categoreis from realistic videos taken from Youtube.
For the CAVIAR project a number of video clips were recorded acting out the different scenarios of interest. These include people walking alone, meeting with others, window shopping, entering and exitting shops, fighting and passing out and last, but not least, leaving a package in a public place.
The first section of video clips were filmed for the CAVIAR project with a wide angle camera lens in the entrance lobby of the INRIA Labs at Grenoble, France. The resolution is half-resolution PAL standard (384 x 288 pixels, 25 frames per second) and compressed using MPEG2. The file sizes are mostly between 6 and 12 MB, a few up to 21 MB.
A typical frame from the image sequences is below. It shows three individual boxes (yellow) and one group box (green). There are several people in the video sequence that are not boxed because they do not move over the course of the sequence.
Walking along, meeting others, window shopping, entering shops, fighting, passing out, leaving package.
MPEG2 or separate jpegs.
The ground truth for these sequences was found by hand-labeling the images, as in the example shown above. The JAVA programs for the interactive labeller can be found here (unsupported). This is the userguide.
If you publish results using the data, please acknowledge the data as coming from the EC Funded CAVIAR project/IST 2001 37540, found at URL:
Clip Example :
Six basic scenarios acted out by the CAVIAR team members: Walking, Browsing, Resting, Leaving bags behind, People meeting/walking together/splitting up and Two people fitting. There are about 3 to 6 clips for each scenario.
6. HOHA Dataset 
a. Hollywood Human Actions dataset
Contains 8 classes of human actions from 32 Hollywood movies: AnswerPhone, GetOutCar, HandShake, HugPerson, Kiss, SitDown, SitUp, StandUp.
b. Hollywood-2 Human Actions and Scenes dataset
Extended from the HOHA dataset. It contains 12 classes of human actions and 10 classes of scenes. There are over 3669 video sequences and approximately 20.1 hours of video in total.
Ivan Laptev, Marcin Marszalek, Cordelia Schmid and Benjamin Rozenfeld: Learning Realistic Human Actions from Movies, CVPR2008.
7. UIUC Badmindton Activity Dataset
The UIUC Badmindton Activity Dataset contains 14 types of human activities of the badminton match videos. The 14 types of activities are: walking, runnning, jumping, waving, jumging jacks, clapping, jump from situp, raise 1 hand, stretching out, turning, sitting to standing, crawling, pushing up, and standing to sitting.
Video Examples:

Du Tran, Alexander Sorokin, Human Activity Recognition with Metric Learning, ECCV08, France.
8. UMN Dataset
An unusual crowd activity dataset consisting of 11 different scenarios of escape events in 3 different scenes.
9. CANTATA Project
Details:  This site contains details and links to all of the PETs databases as well as "Behave" and "Traffic" databases
Actions: pedestrian and vehicle activities; but focus is on tracking
Files: Some have annotations
Acknowledgements: See individual datasets
Clip Example:
10. Traffic Intersection Video
Details:  This site contains a collection of video clips of traffic intersections in Berthold-StraBe in Karlsruhe captured with stationary cameras in grayscale and some in color
Actions: typical vehicle activities
Files: MPEG videos of different qualities and zip files
Acknowledgements: See individual datasets
Clip Example: Screen shot of some of the video clips
11. Videoweb Activities Dataset
In order to obtain the Videoweb dataset, you must go to its website:, and follow its protocol (requires you to sign a release form)
Details:  The Videoweb dataset consists of about 2.5 hours of video observed from 4-8 cameras. The data is divided into a number of scenes that were collected over many days. Each scene is observed by a camera network where the actual number of cameras changes by scene due the nature of the scene. For each scene, the videos from the cameras are available. Annotation is available for each scene and the annotation convention is described in the dataset. It identifies the frame numbers and camera ID for each activity that is annotated. The videos from the cameras are approximately synchronized. The videos contain several types of activities including throwing a ball, shaking hands, standing in a line, handing out forms, running, limping, getting into/out of a car, and cars making turns. The number for each activity varies widely.
1. High-Level Interaction Recognition: hand shaking, hugging, kicking, pointing, punching, and pushing
2. Aerial view activity classification: pointing, standing, digging, walking, carrying, running, wave1, wave2, and jumping
3. Wide area search and recognition
Files: MPEG videos of with annotations
If you make use of the Videoweb dataset in any form, please cite the following reference.
  author = "C. Ding and A. Kamal and G. Denina and H. Nguyen and A. Ivers and B. Varda and C. Ravishankar and B. Bhanu and A. Roy-Chowdhury",
  title = "{V}ideoweb {A}ctivities {D}ataset, {ICPR} contest on {S}emantic {D}escription of {H}uman {A}ctivities ({SDHA})",
  year = "2010",
  howpublished = "\_Area\_Activity.html"
Clip Example:
12. UTexas Datasets
a. UT-Interaction Dataset:
Activities: The UT-Interaction dataset contains videos of continuous executions of 6 classes of human-human interactions: shake-hands, point, hug, push, kick and punch. Ground truth labels for these interactions are provided, including time intervals and bounding boxes.
Details: There is a total of 20 video sequences whose lengths are around 1 minute. Each video contains at least one execution per interaction, providing us 8 executions of human activities per video on average. Several participants with more than 15 different clothing conditions appear in the videos. The videos are taken with the resolution of 720*480, 30fps, and the height of a person in the video is about 200 pixels.
author = "Ryoo, M. S. and Aggarwal, J. K.",
title = "{UT}-{I}nteraction {D}ataset, {ICPR} contest on {S}emantic {D}escription of {H}uman {A}ctivities ({SDHA})",
year = "2010",
howpublished = "\_Interaction.html"
b. UT-Tower Dataset:
Activities: The UT-Tower dataset consists of 108 low-resolution video sequences from 9 types of actions. Each action is performed 12 times by 6 individuals. The dataset is composed of two types of scenes: concrete square and lawn.
Details: Ground truth labels for all actions videos are provided for the training and the testing. In addition, in order to alleviate segmentation and tracking issues and make participants focus on the classification problem, the ground truth bounding boxes as well as foreground masks for each video are provided. Only the acting person is included in the bounding box.
author = "Chia-Chih Chen and M. S. Ryoo and J. K. Aggarwal",
title = "{UT}-{T}ower {D}ataset: {A}erial {V}iew {A}ctivity {C}lassification {C}hallenge",
year = "2010",
howpublished = "\_View\_Activity.html"
13. Olympic Sports Dataset
About Dataset : The Olympic Sports Dataset contains videos of athlets practicing different sports. All videos sequences of this dataset are from Youtube dataset.
Sports: the 16 types of sports to recognize are hign-jump, long-jump, triple-jump, pole-vault, basketball lay-up, bowling, tennis-serve, platform diving, discus throw, hammer throw, javelin throw, shot put, springboard diving, snatch (weightlifting), clean and jerk (weightlifting), and gymnastic vault.
Juan Carlos Niebles, Chih-Wei Chen and Li Fei-Fei. Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification. 11th European Conference on Computer Vision (ECCV) , 2010.
14. TV Human Interaction Dataset
The TV Human Interaction Dataset consists of 300 video clips collected from over 20 different TV shows and containing 4 interactions: hand shakes, high fives, hugs and kisses, as well as clips that don't contain any of the interactions.
Annotations: provided with 1) annotations of the upper body of people with a bounding box; 2) discrete head orientation (profile-left, profile-right, frontal-left, frontal-right and backwards); 3) interaction label of each person.
Patron-Perez, A., Marszalek, M., Zisserman, A. and Reid, I., High Five: Recognising human interactions in TV shows, Proceedings of the British Machine Vision Conference (BMVC), Aberystwyth, UK, 2010.
15. HMDB 51 Dataset
About Dataset: HMDB are collected from various sources, mostly from movies, and a small proportion from public databases such as the Prelinger archive, YouTube and Google videos. The dataset contains 6849 clips divided into 51 action categories, each containing a minimum of 101 clips.
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A Large Video Database for Human Motion Recognition. ICCV, 2011.
16. OSUPEL Basketball Dataset
The OSUPEL basketball dataset contains videos of actual basketball games and drills for primitive events and complex activity recognition. The dataset was captured with 960*540, 29fps resolution handheld camera set on a tripod.
Player Primitive Events:
Passing, Catching, Holding Ball, Shooting, Jumping, Dribbling.
AVI videos with annotations of event label, event starting/ending time, and player/object bounding boxes.
If you make use of the Videoweb dataset in any form, please refer to the following work:
William Brendel, Alan Fern, and Sinisa Todorovic, Probabilistic Event Logic for Interval-Based Event Recognition, CVPR 2011.
Clip Example :
17. VIRAT Video Dataset
VIRAT Video Dataset contains two broad categories of activties (single-object and two-objects) which involve both human and vehicles. Details of included activities, and annotation formats may differ per release. Please refer to the official website of VIRAT Video Dataset for details.
If you make use of the VIRAT Video Dataset, please use the following citation:
"A Large-scale Benchmark Dataset for Event Recognition in Surveillance Video" by Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit Mukherjee, J.K. Aggarwal, Hyungtae Lee, Larry Davis, Eran Swears, Xiaoyang Wang, Qiang Ji, Kishore Reddy, Mubarak Shah, Carl Vondrick, Hamed Pirsiavash, Deva Ramanan, Jenny Yuen, Antonio Torralba, Bi Song, Anesco Fong, Amit Roy-Chowdhury, and Mita Desai, in Proceedings of IEEE Comptuer Vision and Pattern Recognition (CVPR), 2011.
18. MPII Cooking Activities Dataset
MPII Cooking Activities Dataset records 12 participants performing 65 different cooking activities, such as "cut slices", "pour", or "spice". 44 videos with a total length of more than 8 hours are included.
Raw data for videos as AVIs, images as JPGs. Video features including dense trajectory features, HoG, HoF, MBH and pose features. Ground truth annotations.
When using MPII Cooking Activities dataset, please site the following paper:
M. Rohrbach, S. Amin, M. Andriluka and B. Schiele, A Database for Fine Grained Activity Detection of Cooking Activities, CVPR2012.
19. GeorgiaTech Egocentric Activities (GTEA)
About Dataset: This dataset contains 7 types of daily activities, each performed by 4 different subjects. The camera is mounted on a cap worn by the subject.
Activities: Cheese Sandwich, Coffee, Coffee with Honey, Hotdog Sandwich, Peanut-butter Sandwich, Peanut-butter and Jam Sandwich, Sweet Tea.
Alireza Fathi, Xiaofeng Ren, James M. Rehg, Learning to Recognize Objects in Egocentric Activities, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
20. Activities of Daily Living (ADL) Dataset
ADL is a dataset for detecting activities of daily living in first-person camera views. It includes 18 actions performed by 20 people.
ADL Videos, object and action annotations, results for running part-based object detectors, and training and test code for detecting ADL using object centeric features.
Please refer to the following paper:
Hamed Pirsiavash and Deva Ramanan, Detecting Activities of Daily Living in First-person Camera Views, CVPR2012.
21. JPL First-Person Interaction Dataset
About Dataset: JPL First-Person Interaction dataset (JPL-Interaction dataset) is composed of human activity videos taken from a first-person viewpoint. The dataset particularly aims to provide first-person videos of interaction-level activities, recording how things visually look from the perspective (i.e., viewpoint) of a person/robot participating in such physical interactions.
Activities: hand shake, hug, pet, wave, point-converse, punch and throw.
M. S. Ryoo and L. Matthies, "First-Person Activity Recognition: What Are They Doing to Me?", CVPR 2013.
22. NIST TRECVID Multimedia Event Detection Track
The NIST TRECVID Multimedia Event Detection 2011 (MED11) dataset consists of a collection of Internet videos collected by the Linguistic Data Consortium from various Internet video hosting sites. It includes 1400 hours of videos with 45000 clips. There are 15 complex events.
Complex Events in MED11:
Attempting a Board, Feeding an Animal, Landing a Fish, Wedding Ceremony, Wroking on a Woodworking project, Birthday Party, Changing a Vehicle Tire, Flash mob Gathering, Getting a Vehicle Unstuck, Grooming an Animal, Making a Sandwich, Parade, Parkour, Preparing an Appliance, Working on a Sewing Project.
NIST keeps updating this dataset with MED12 and MED13 releases.