So just the ETL portion is that slow? That's really odd. I think I see faster performance than that ETLing on Hadoop. Unless there's something complex going on here.
Did you see the steps being performed in the workflow? How would you perform these steps on hadoop and what kind of times would you expect for the different steps in the workflow?
I went back to see and only just realized you had a “TLDR here’s the full details” bit. Missed that and saw only the 5x faster ETL which I figured was what you were crowing about. I’ll have to look at the rest later. I still don’t get why the ETL phase has to be so slow but maybe it’ll be obvious when I look, like you’re doing extensive transformations or something. But no one would just call that “just ETL”. Anyway, I’ll see for sure later.
Cacaw! I am glad the crowing reached you :). You probably did not get far if you are still on the old numbers that was in the first paragraph.
ETL can be very extensive. For example, we first built this when we had to take data from 15 different database systems that represented individuals and their pension contributions and join across these systems. It was a largish join about ~4 tables from each system so around a 60 table join.
That was "JUST ETL". The job was preparing the data for training. ETL is often times a large part of people's workflow. Looking for needle situations in a haystack. That can be JUST ETL. If you believe there is a more apt word for extracting data from a system, performing unspeakable transformations on it, then making that information available to another process then please tell me. Being of Peruvian stock myself I take great license with my language and grammar.