If you use a lock, then only one thread can enter the locked region at the same time.
If you use a sufficiently advanced STM, and have an atomic region, then two threads can enter the same region, as long as they don't use each other's memory. You may think, why synchronise if they're not using each other's memory? Well it can be extremely hard to work out a priori if two threads are using the same memory or not. In fact for some problems, such as many graph problems, working out what memory they're going to use is the same problem as the problem that you're trying to solve in the first place.
STM is like having a single lock for every single word in your memory, and then magically always acquiring every lock that you will need at the start of the transaction in a determinate order, so that you don't get deadlock.
If two threads enter an atomic region, but do access each other's memory, then the transaction is silently aborted and restarted. Ideally these aborts are rare and the program runs very much in parallel.
A more concrete example - in Java if you have a hash table you will usually lock during all reads and writes. If you used STM, and two writes accessed different buckets of the table, then there's no need to lock. The STM works this out for you.
Everything you said is correct, but it may helps others to know this is called optimistic synchronization, which is a nice, Googleable term. Pessimistic synchronization uses locks; if you need to synchronize, you assume that anyone else will access it in a way that conflicts with you, so you lock it, gain exclusive access, and make everyone else wait.
Optimistic synchronization makes the opposite assumption, which allows for higher concurrency in some situations. The cost is all of the overhead to do conflict resolution if it turns out two threads do want to access and change the same pieces of memory.
Ah, thanks for the clarification. From the post I gathered that the atomic region locked the entire data structure, and that any thread would wait, I didn't realize it was a specific data region.
That sounds pretty nice, then. I'm looking forward to see the sort of speedups possible with this.
Well take a look at any of the copious literature on the subject:
This paper shows some example speedups, but generally doesn't give anything to compare to. Find papers that reference this paper, to see more examples.
C. C. Minh, J. Chung, C. Kozyrakis, and K. Olukotun. STAMP: Stanford transactional applications for multi-processing. In IEEE International Symposium on Workload Characterization, 2008.
The fact that threads are one of the most widely supported and frequently chosen models for concurrency programming should be a clue that they are at least not a bad choice. I applaud the PyPy project for trying something different, but also doing it in a way where they did not paint themselves into a corner with a less-than-ideal solution.