Facial data capture

The Facial Capture extension collects quantitative data such as facial feature points, head rotation, and head translation. You can use this data to drive 3D facial stickers, headgear, pendant applications, or digital humans, adding more vivid expressions to virtual images.

This guide is intended for use-cases where the facial capture extensions is used independently to capture facial data, while a third-party rendering engine is used to animate a virtual human.

Information

To avoid collecting facial data through callbacks and building your own collection, encoding, and transmission framework, use the Agora MetaKit extension for facial capture.

Prerequisites

Ensure that you have:

Implemented the SDK quickstart in your project.

Integrated Video SDK version 4.3.0, including the face capture extension dynamic library libagora_face_capture_extension.so.

Obtained obtained the face capture authentication parameters authentication_information and company_id by contacting Agora technical support.

Implement the logic

This section shows you how to integrate the Facial Capture extension in your app to capture facial data.

Enable the extension

To enable the Facial Capture extension, call enableExtension:

Java
Kotlin

rtcEngine.enableExtension("agora_video_filters_face_capture",    "face_capture",     true,     Constants.MediaSourceType.PRIMARY_CAMERA_SOURCE);

rtcEngine.enableExtension("agora_video_filters_face_capture",    "face_capture",     true,     Constants.MediaSourceType.PRIMARY_CAMERA_SOURCE)

Info

When you enable the Facial Capture extension for the first time, a delay may occur. To ensure smooth operation during a session, call enableExtension before joining a channel.

Set authentication parameters

To ensure that the extension functions properly, call setExtensionProperty to pass the necessary authentication parameters.

Java
Kotlin

rtcEngine.setExtensionProperty(    "agora_video_filters_face_capture",    "face_capture",    "authentication_information",     "{"company_id":"xxxxx","license":"xxxxx"}",     Constants.MediaSourceType.PRIMARY_CAMERA_SOURCE);

rtcEngine.setExtensionProperty(    "agora_video_filters_face_capture",    "face_capture",    "authentication_information",     "{"company_id":"xxxxx","license":"xxxxx"}",     Constants.MediaSourceType.PRIMARY_CAMERA_SOURCE)

Retrieve facial data

Retrieve the raw video data containing facial information through the onCaptureVideoFrame callback.

Java
Kotlin

public boolean onCaptureVideoFrame(int sourceType, VideoFrame videoFrame) {    if (null != videoFrame.metaInfo) {        VideoFrameMetaInfo metaInfo = videoFrame.getMetaInfo();        SparseArray<IMetaInfo> customMetaInfo = metaInfo.getCustomMetaInfo(            "FaceCaptureInfo"        );        if (null != customMetaInfo && customMetaInfo.size() >= 1) {            String face_info =                ((FaceCaptureInfo) (customMetaInfo.get(0))).getInfoStr();            Log.d(TAG, "Face Info: " + face_info);        }    }    return true;}

override fun onCaptureVideoFrame(sourceType: Int, videoFrame: VideoFrame): Boolean {    videoFrame.metaInfo?.let {        val metaInfo = videoFrame.metaInfo        val customMetaInfo = metaInfo.getCustomMetaInfo("FaceCaptureInfo")        if (customMetaInfo != null && customMetaInfo.size() >= 1) {            val faceInfo = (customMetaInfo[0] as FaceCaptureInfo).infoStr            Log.d(TAG, "Face Info: $faceInfo")        }    }    return true}

Important

Currently, the facial capture function outputs data for only one face at a time. After the callback is triggered, you must allocate memory separately to store the facial data and process it in a separate thread. Otherwise, the raw data callback may lead to frame loss.

Use facial information to drive virtual humans

The output facial data is in JSON format and includes quantitative information such as facial feature points, head rotation, and head translation. This data follows the Blend Shape (BS) format in compliance with the ARKit standard. You can use a third-party 3D rendering engine to further process the BS data. The key elements are:

faces: An array of objects, each representing recognized facial information for one face.
- detected: A float representing the confidence level of face recognition, ranging from 0.0 to 1.0.
- blendshapes: An object containing the face capture coefficients. The keys follow the ARKit standard, with each key-value pair representing a blendshape coefficient, where the value is a float between 0.0 and 1.0.
- rotation: An array of objects representing head rotation. It contains three key-value pairs. All values are floating points between -180.0 and 180.0.
  - pitch: The pitch angle of the head. Positive values represent head lowering, negative values represent head raising.
  - yaw: The angle of head rotation. Positive values represent left rotation, negative values represent right rotation.
  - roll: The tilt angle of the head. Positive values represent right tilt, negative values represent left tilt.
translation: An object representing head translation, with three key-value pairs: x, y, and z. The values are floats between 0.0 and 1.0.
faceState: An integer indicating the current face capture control state:
- 0: The algorithm is in surface capture control.
- 1: The algorithm control returns to the center.
- 2: The algorithm is restored and not in control.
timestamp: A string representing the output result's timestamp, in milliseconds.

This data can be used to animate virtual humans by applying the blendshape coefficients and head movement data to a 3D model.

{
    "faces": [{
        "detected": 0.98,
        "blendshapes": {
            "eyeBlinkLeft": 0.9, "eyeLookDownLeft": 0.0, "eyeLookInLeft": 0.0, "eyeLookOutLeft": 0.0, "eyeLookUpLeft": 0.0,
            "eyeSquintLeft": 0.0, "eyeWideLeft": 0.0, "eyeBlinkRight": 0.0, "eyeLookDownRight": 0.0, "eyeLookInRight": 0.0,
            "eyeLookOutRight": 0.0, "eyeLookUpRight": 0.0, "eyeSquintRight": 0.0, "eyeWideRight": 0.0, "jawForward": 0.0,
            "jawLeft": 0.0, "jawRight": 0.0, "jawOpen": 0.0, "mouthClose": 0.0, "mouthFunnel": 0.0, "mouthPucker": 0.0,
            "mouthLeft": 0.0, "mouthRight": 0.0, "mouthSmileLeft": 0.0, "mouthSmileRight": 0.0, "mouthFrownLeft": 0.0,
            "mouthFrownRight": 0.0, "mouthDimpleLeft": 0.0, "mouthDimpleRight": 0.0, "mouthStretchLeft": 0.0, "mouthStretchRight": 0.0,
            "mouthRollLower": 0.0, "mouthRollUpper": 0.0, "mouthShrugLower": 0.0, "mouthShrugUpper": 0.0, "mouthPressLeft": 0.0,
            "mouthPressRight": 0.0, "mouthLowerDownLeft": 0.0, "mouthLowerDownRight": 0.0, "mouthUpperUpLeft": 0.0, "mouthUpperUpRight": 0.0,
            "browDownLeft": 0.0, "browDownRight": 0.0, "browInnerUp": 0.0, "browOuterUpLeft": 0.0, "browOuterUpRight": 0.0,
            "cheekPuff": 0.0, "cheekSquintLeft": 0.0, "cheekSquintRight": 0.0, "noseSneerLeft": 0.0, "noseSneerRight": 0.0,
            "tongueOut": 0.0
        },
        "rotation": {"pitch": 30.0, "yaw": 25.5, "roll": -15.5},
        "rotationMatrix": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0],
        "translation": {"x": 0.5, "y": 0.3, "z": 0.5},
        "faceState": 1
    }],
    "timestamp": "654879876546"
}

Disable the extension

To disable the Facial Capture extension, call enableExtension:

Java
Kotlin

rtcEngine.enableExtension(    "agora_video_filters_face_capture",     "face_capture",     false,     Constants.MediaSourceType.PRIMARY_CAMERA_SOURCE);

rtcEngine.enableExtension(    "agora_video_filters_face_capture",     "face_capture",     false,     Constants.MediaSourceType.PRIMARY_CAMERA_SOURCE)

Facial data capture

Prerequisites

Implement the logic

Enable the extension

Set authentication parameters

Retrieve facial data

Use facial information to drive virtual humans

Disable the extension

Reference

Sample project

API reference