I don't really care what others say, but to prove that this has any performance value you should do the following:
Compare your most "parallel" algorithm with the performance of a corresponding well-written MPI application using openmpi's shared memory transport. If there is a difference, then your system has some value.
Of course openmpi's shared memory transport is terribly buggy, but it should give a baseline acceptable performance.
If there is no comparison, we have no idea.