Academy – Xing Zhang's Mind

CVPR 2019 Day 1

What an early flight to Long Beach! Waked up at 3:30am and noticed that there was Lyft driver available at mid-night. Have to say that Lyft/Uber makes life easier. But just a reminder that SJC won’t open checkin until 4:00 am… So don’t be rush there anyway.

So Sunday and Monday are for the workshops. The morning I went to the 3D Scene Understanding workshop and listen to a good talk on “What do Single-view 3D Reconstruction Networks Learn?” It points out the current state of the arts single image reconstruction work is, by large chance, just image retrieval. This is due to that the shape similarity measurement is not good enough and the training set is contaminated by models which already looks very similar like the one in the test set. And using a certain model pose view as the single image input fits 2D image case but does not really the best one for 3D mesh case. The talk really resolve some issues in 3D reconstruction research and I think the paper is good to read. You can find the paper Here. And here is the Youtube video for the talk

However, at the same day Facebook AI also provide their Mesh R-CNN basically to reconstruct mesh from single image like there R-CNN to create 2D mask from a single image. Hence it would be interesting to check that paper to see if it violates any issues points out by the work above.

In the afternoon my colleague leads me to the ScanNet benchmark challenge workshop. Professor Matthias Nießner is really active in facial/body reconstruction work and now his work also expand to general scene 3D capture and registration. ScanNet is trying to create data set with vertex level labeling plus 3D bounding box , like the 3D version of the ImageNet. The workshop is basically an exhibition of all the people who participate in the detection task with the ScanNet dataset. The Stanford’s work achieve very good result by take advantage of temporal coherence information. It is a very interesting idea and fundamentally optimize the data representation and training procedure. Very nice result.

Later afternoon I went back to my original research domain to take look of the state-of-the-art human body/facial capture survey style talk given by Michael J Black. I do feel there is great potential. Need to investigate when I have some spare time. And here is the video for that.

I think Monday I will be in the AR/VR session. Hope to learn more in this region. Or at least see what is the part people still not cover yet…

SIGGRAPH 2018 Day 4

Today is a little casual. In the morning, I visit Nvidia ray tracing/path tracing section. They emphasize that similar like the first GPU card in 1998, The RTX is a new thing that everyone should try to catch up with.

Then I also went to 3D capture session. The papers there are all very interesting. I think we are also at an important stage at the moment.

In the afternoon, I went to material capture session, It is glad to see how a deep learning model is trained which can use differentiable render for material generation from one image file. I do need to look into this work.

SIGGRAPH 2018 Day 3

Today is a time space mixture adventure. Try to get into the talk of two state of the arts face related paper in the VR session in the morning. One is from TMU and the other is from Facebook Reality Lab. Both of them try to tackle the issue on how to show genuine whole face expression in VR while both sides wear the headset.

On one side, Matthias Niessner and his golden face synthesize team explore how to deal with this issue based on their face-to-face work. The advantage is that due to they use generic face model, the representation is not strongly subject dependent so no calibration and pre-capture is necessary. However, due to only use infrared camera inside headset for eye gaze tracking, the upper face’s expression may not be preserved.

While facebook use subject dependent high quality model for this work. And use deep learning on teeth composition. The quality looks better. However, it needs pre-capture for the subject.

And thanks to my friends from Pixar, this time we notice that there is no booth for the animation studio so we don’t know where to pick up the renderman teapot. It is turn out that they release after their renderman 22 demo talk which last for 1 hour. It is actually a really good talk. 30 years development of renderman, from scan line renderer to ray tracing, and then path tracing. They give up old infrastructure for physical correct and simple models. It is glad to see at this stage, ray tracing lighting can be achieved in an interactive speed. With the help of Nvidia’s RTX, I think the production time for all stages of animation can be shrink and we could see more ideas in the movie since the cost to try out new story line, camera, actions, etc are cheaper. But the most important thing is get my teapot!

The real-time live! demo session is also crazy. The Nvidia RTX, ILM X LAB, and Unreal combined VR virtual movie shot demo is totally a game changer on how we can make movie quality shots in real time with everyone inside a virtual environment. I can image in near future, the individual shot may be captured in this real-time ray tracing environment. Then the director can cut the movie to review, and handle that short to the off-line renderer, if necessary, for movie final images.

SIGGRAPH2018 Day 2

So today’s major coverage is two speeches, one is from Rob Bredow, VP of ILM. The other is form the CEO of NVIDIA.

Rob’s talk is the power of creative process. In which he talked about his experiences to be the first time VFX producer on the Star Wars movie: Solo.

He mentioned the people will have 3 different stages during the creative process:

Just start: when you want to be in the field.

During the beginning, people should do study. And try to build the things from other’s work. More like interdisciplinary study. It is easier to create something based on other stuff.

Know the theme: when you already know the tools and try to actively work in the field.

Lead: How to lead creative process.

During this stage, people need to first define the theme, which is the concept you try to follow. Make sure to work on this path before dive into the detail. He use the example on the solo film where he hope to go back to the classic 70’s film style. Hence the movie production explicitly uses rig for the hyper speed traveling set, and under water explosion, which relies on the real hardware (huge 180 degree LED screen, and 20 thousands fps camera) to get real lighting and “explosion never seen before”.

Then it is about learn on the constraint, so people can focus on the right thing. He mentioned how the roller coaster in Disney’s Animal Kingdom was created. From the beginning when it is not fit into the style. Then people visit Nepal and found the story of Yeti to build up the story about Everest and Yeti for the roller coster.

Third is simplify. Try to make the target simple. He mentioned about a shot in World War where a rig is jumped out during a crash scene, which may need retouching the scene to remove. However, no one actually knows what that is and people pay attention to the character’s face, so it is indeed not that important to spend extra time on removing in the film.

The third is about share. Rob mentioned on the start of ASWF, the academy software foundation, where the film industry first time try to organize their software together to share tools between companies.

ASWF actually starts with a lot of big names. I think to explore these repositories could also help new people to get into the business.

He also proposed his photo book he made during the Solo movie, I think this is a very good collection.

Nvidia’s special event is crazy to attract a lot of people. It is also my first time to see the CEO’s iconic gesture: hold the nvidia card on the stage. The event is basically the announcement of the next big thing since CUDA introduced in 2003. The Turing architecture, where Nvidia makes real-time ray tracing rendering possible.

10 Gig ray per second, mixed operation on GPU 16 TFLOPS and 16 TIPS, 500 Trillion tensor ops per second, 8K image decoder. This monster makes real time ray tracing possible. It dramatically reduce the time of physical based rendering for movie quality images, hence could be very attractive to the movie industry. And since the basic version is not that expensive ($2300, I think it is worthy than some AR glasses), we may expect soon game developer may not need to play too much tricks on the shading effect while just let things following the physics law.

Mr. Huang really enjoys to use the high glossy RTX card to play with the audience.

Demo on the real time ray tracing Star Wars shot. The light does look real!

Introduce how different hardware/software stack it is for the new arch.

SIGGRAPH 2018 Day 1

Day one is so many people! Next time if I arrived one day earlier, I should do registration first.

So in the morning, I went to the Vulkan course, really helpful to understand this thing and glad it has all the support we expect to. I think it is the way to go.

Then we visit the product exhibition, nice to see the probs from Infinity War and Solo.

The AR session hosted by apple basically go over what they said at WWDC, which makes AR still a pretty new thing for the graphics industry. I can sense that people are looking for new things, but they hesitated on the future.

In this case, what should we do? The Jurassic park 25 year screening gives the answer. You just spend your spare time and do it. Then break the old business. From 0 to 1, that is how we make progress.

See everyone in day 2.

Hello to SIGGRAPH 2018!

I think time indeed goes fast and it has been 3 years since my first and amazing SIGGRAPH experience. Now it is Vancouver, with a new me on Amazon’s AR platform and try to make it better.

So Sunday is the beginning, I plan to check in and take course: Intro to Vulkan. And later afternoon for deep learning maybe (but I feel the deep learning one may be too simple).

CVPR Experienes: Conference Session 2

On Tuesday, the major show is the Face session! However, we have to say that Face related research is not like the main thing in CVPR. As the session name indicates: “Computational Photography and Faces”. Sure, we have a lot of poster about face modeling and expression detection, but the limited oral session tells the trend now. No worry, however, in SIGGRAPH, Face session is always packed up with people!

So in the afternoon oral session, we have 5 oral presentations, which definitely shine the state-of-the-art work in this region. I am so much loving this section due to the fact that this is what I belong to, and of course, our lab has contributed to one of the paper!

13. Recurrent Face Aging: a cool data set in 2D which contains a lot of faces covering large aging space is crated and used to predict human aged face.

14. Face2Face: Real-Time Face Capture and Reenactment of RGB Videos: What I can say, the jaw-dropping demo video since last SIGGRAPH Aisa. This time they updated the model to work with only 2D rgb image. The presentation is cool because it ends with a live demo with Putin as the target agent! So the Basel Morph Model is used to do identity morphing, which need the user to show a frontal face, rotate left then right to create subject dependent 3D face model. The initialization procedure takes about 30 seconds. Then we obtain a fully controllable avatar. The texture albedo is also learned. Based on my demo test, the system is pretty nice and smooth, however, don’t expect it can handle directional light, might just be like the global light source. Without tongue model, it is still can accurately modeling the lip movement so normal speaking should be OK. There are more interesting story behind this. To me, it is such a nice experience to meet the authors here at CVPR!

15. Self-Adaptive matrix Completion for Heart Rate Estimation From Face Videos Under Realistic Conditions: A stable face region is located in the general face image and the model can used to detect the heart rate from the image space. It is just so glad to see the demo and illustration video/image are actually from our database!

16. Visually Indicated Sounds: MIT always has the balls to do cool stuff. The authors notice that human beings can indicate the sound of the materials pretty well even only with the image. So they spend a lot of time to use drum stick to hit “A LOT” object and recording with video camera. Then they train the deep learning model so that the machine can pick up the motion of drum stick hit certain “objects” and simulate the corresponding audio.
We know that in movies, sometimes the audio composers can not get the real some of the scene due to different reasons and need to create the audio effect with other stuff. This CVPR paper is like a auto way to do this.

17. Image Style Transfer Using Convolutional Neural Network: Transfer Van Gogh’s painting style to your image automatically? This is the instruction for you.

Here are some photos again, to cover the topics on the second day.

CVPR Experiences: Conference Session 1

First day of CVPR is packed with some good talk which shows the trend of the computer vision research right now. Day one is packed with object detection work, especially by using convolution neural network (CNN, aka deep learning approach).

Here I just report some interesting work:

Matching and Alignment:

Learning to Assign Orientations to Feature Points: Include the orientation learning in 3D reconstructions by CNN implicitly will help to obtain the missing part of alignment, so you have less holes. It sounds like the orientation of the image patch can play a key role in image alignment.
Learning Dense Correspondence via 3D -Guided Cycle: Directly apply to car, this paper talks about how to find matches in two images. The similarity need to be at the component level. In this way, you can reconstruct image B with information from image A’s pixels, while still maintain the structure and orientation of image B. It shows how to do the 3D model to 2D image alignment. And with possible occlusion, matchability learning is the way they try. Possible extension of the work is to extend the patch to the entire target so even in the occlusion case we can have a fully recovered image.
The Global Patch Collider: Try to find the Patch which matches in different images, by forest voting.
Joint Probabilistic Matching Using m-Best solution: a little optimization by using a sampling weighted function to choose several sub-optimal solutions.
Face Alignment Across Large Poses: A 3D Solution. In traditional way, face alignment rely on the fact that all the tracking points are available, which is too strong. To training on large pose tracking data, we normally do not have this kind labeled data. In this paper, the people use synthesized training data by align a morphable model to the any face pose with knowing pose information, then get the 3D position and the corresponding 2D intensity plus the pose. Then a CNN can be trained to locate the correspondence.

During the spotlight session, Segmentation and Contour Detection is covered.

Affinity CNN: Learning Pixel-Centric Pairwise Relations for Figure/Ground Embedding: Should look into.

Then basically I went to the poster session so take some photos about different posters I am interested in. One talk about real-time (80 fps) CNN, with lower accuracy got my attention. Low memory bandwidth with full code and “How to run” tutorial, this could be a very good way to try some cool idea: The detail about this can be found at Pjreddie.com/yolo.

Here are some poster photos:

At the end of first day, the best paper award and related work has been announced. MSRA’s new deep learning model “Deep Residual Learning for Image Recognition” shows Microsoft’s position in this deep battle. By winning all the major competitions during 2015, it does not sound that the model is very elegant, but it works. The best paper award to this paper settles the tune of this CVPR to still be “Deep Learning”. And later during the CVPR we notice that the author of the paper, Jian Sun, has been dig out from MS Asian Research to Face++ by a super good salary (like 8 digit in Chinese Yuan). As I know, a good PhD student focus on deep learning now normally don’t worry about job and salary at all. They are like bias in the market because there are so much data but so little people have the hints on how to dig them.

What you need to do basically to write a academic paper

I think this would be a little useful experience to share about how to write a paper in computer science. This may not be a unique way, but I generally like to see paper written in this style.

Below is a general email I sent after revision of some work. I think it also gives some ideas about some basic gradient on a basic academic paper writing.

Generally, in the abstract and intro, and the conclusion, you should have a short sentences to summarize how to achieve the goal. Put them as “a method” is too abstract. In the introduction, also need to clearly shows the contribution of the work in bullet list. And as always, give the paper structure like what the second section will say, what section 3 will say… at the end of the introduction.

I think, three things I learned when I write the paper is:

1. Break the long sentence, complex sentence structure makes riddles, not publishable papers.
2. Think writing paper as tell a high school students about how you do the things, rather than talk to ourselves. We may unintentionally omit several clues and details about the work, which may make the paper hard to understand by a stranger.
3. Reviewed paper are normally double blind, which means you should not give any obvious clue on who is the author. In your original work, you may figure out where you leave the trace 🙂 I have updated it.

You can update based on the revision and comments in the current version, and I may try to update the algorithm part.