I agree with Jeremy. The Pi should be left out of any real time stuff. The one strange part of that is playing sounds. I have had good luck with my Pi playing all the sounds. The Van Halen machine has the equivalent of a slightly under powered Pi zero. (It is using a Raspberry Pi, Model B, v1. Same memory as a Pi Zero but with a much slower processor. ) The analog output from my Pi is atrocious, so I bought a USB to audio dongle (costs about $3). The driver for those set up a DMA channel to move the music to the dongle, so it only slightly affects the performance of the processor. Using pygame as the basis I can play about 5 sounds concurrently, and have a background track going continuously.
In the OPP pinball framework, I believe that I use 3 threads. 1 thread deals with the serial communications. 1 thread updates the backglass by blitting as small of an area as possible. (If only updating the second players score, don't redraw the whole backglass, but simply blit the rectangle that contains the score portion.) The last thread runs the rules for the pinball machine. (Timers, looks at the last read values of inputs, posts flags to send new serial commands, etc.). Python suffers from a global interpreter lock, but that separation of processing has served me well.
A Pi 2 and certainly a Pi 3 should have enough horsepower to do what you are trying to do. I assume that none of your communication is bit banged. If it is bit banged, it can take a lot of time.
On the lamp matrices, some machines have them, and some do not. Dolly did not have a matrix, but each lamp was individually brought to a low side switch to turn it on/off. In the old Bally hardware, an I/O expander (similar to your bit shifter), was used to turn the lamps on/off. The I/O expander had a parallel interface to the processor with a chip select to update the values, so it could happen very quickly. With the arduino, it probably uses a parallel to serial converter in hardware in the processor, then the shift register just converts it back to parallel.