Use cuda core & tensor core at the same time

Here are some related threads: 1 2 3 4