Hand tracking on iOS device

In Vision Pro device, we already see the magic how 3D hand joints are tracked in the same 3D space as the camera/viewer. This is the default operation for the space awared hardware.

However, how do we get this similar features on iOS device? As a technology enthusiast, I think it is possible when Apple release their vision SDK with hand tracking feature in 2020. In Vision SDK, the hand joint position is given in the 2D image space. So how can we obtain the 3D space of the data and mixed with Augmented Reality (AR) session is the question.

Now let's first see the result:

To achieve this here is the high level steps:

You need to have a defined 3D space, we use ARKit to construct a 3D space aligned to the world, then we can have camera, object all in this 3D space.
Set up vision SDK with ARFrame rgb video to track hand, in this way we will get hand joint in image space.
Using ARKit depth data and provide method to sample the depth value from a 2D location, you need to create correct mapping between rgb image and depth image so you can sample correct depth.
With ARCamera.camera, you can also get camera intrinsic and extrinsic info.
Then you are facing a classic graphcis problem: if you know the 2D image space coordiante of a point, like the one rendered in a graphics pipeline, how to converted back to the 3D location in the 3D world space. In this case, you need to first obtain the Image space (x,y,z) from the above case. Convert them into NDC space [-1, 1], then multiplied by the inverse of intrincis matrix, then multiple with the inverse of the transforamtion and rotation matrix.
After all of htese, you can have any reliable depth points in the same world space as the ARKit creates. Then you can compute world space position. Using multiple points of the hand you can even calculate the rotation of the hand to apply to the virtual object.

For better occlusion handling we can either extract the hand related depth data and draw it into the frame buffer, let the graphcis pipeline handle depth test.

For a smoother occlucion or physical simulation, using morphable hand model to fit and track based on the 3D joint posision would be a better option.