I can only measure so well, using my system clock. Ask anyone in TRUE HFT, and they will tell you timing is a challenging science. The way I"m measuring things is far from scientific...so take this for what its worth:
Using std::chrono::system_clock my c++ to Python back to c++ time is less than 1 microsecond, which is the resolution of the clock I'm using. Obviously, this will increase if something expensive is done on the python side.
My c++ and python application are running in the same process space, which is why I'm running so much faster than you. Inter-process/in-memory communication will ALWAYS be significantly faster than any tcp based scheme (like http).
I'm no tcp expert, but I suspect http is relatively slow compared to other forms of tcp. That may be a place to start.
Also...If your doing something where a few ms of latency makes or breaks your success, you need to be really careful with how you're using python.