Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> runs SQL in desired order

Sure, one at a time. dbt builds a DAG of the SQL operations and can run transformations concurrently.



and why would you need this? If the goal is to reduce e2e latency, then optimizing SQL is likely much more beneficial, since it can improve performance by NNN times, not just by N times.


Well there's 2 main reasons.

The first is that you can update a subset of your pipeline, like say you have an internal dashboard for server stats and another external dashboard that's a product for customers, and you notice a problem w/ the server stats dashboard. You can have dbt just build the models that underlie the server stats dashboard and also exempt any model used by the external dashboard.

The second is that you can effectively defer processing. dbt builds views by default (though it of course will also build tables, including ephemeral tables that are cleaned up after a build), which means your transforms don't take place until you query the data. If you can get away with it, this is what you want, and there are lots of ways to stay on views (caching, use BigQuery, etc). This means your pipeline at least has the potential of being "real time".

But, overall we're talking about pipelines that have 100s or 1000s of models. You need some kind of system for that; a shell script isn't gonna cut it.


You obviously can have modularity with shell scripts:

run_all_etl.sh

build_internal_dashboard_etl.sh (called from run_all_etl.sh)

build_external_dashboard_etl.sh (called from run_all_etl.sh)


No you misunderstand. If you have some kind of dependency tree that's like:

    A        B        C        D
        1        2        3
        i        ii
    Ai     i-ii
...and you want to only rebuild Ai and anything that it might rely on, all you do is `dbt build +Ai`. If you want to rebuild B and update anything that relies on it, all you do is `dbt build B+`

> You obviously can have modularity with shell scripts

Just as kind of a meta point, I think you're assuming that people are introducing a lot of incidental complexity into this problem, but I don't think you're open to the possibility that there's a lot of essential complexity here. I'm talking about DAG tools and the features that hang off of them, and you're saying "hey there's new thing called shell scripting". That's not useful in this conversation. Maybe going forward, let's assume people know about Bash.


In my example ETL is divided into modules with clear scope and dependencies, as I believe should happened in software practices.

If you have kitchen sink of 1000 tables chaotically connected to each other, then yes, you need DAG management tool to manage chaos..


> If you have kitchen sink of 1000 tables chaotically connected to each other, then yes, you need DAG management tool to manage chaos.

This describes 99% of data teams, so I think we agree!


> This describes 99% of data teams, so I think we agree!

there is a chance that dbt incentivies data teams to go that route, otherwise they would think about modularity and ownership more.


I'm a Vim user and I've made this point when I rail against IDEs; it's even on the front page today [0]: "5. The belief that complex systems require armies of designers and programmers is wrong. A system that is not understood in its entirety, or at least to significant degree of detail by a single individual, should probably not be built."

I think I agree, but I think there are a lot of things you could say about this. What if you're just starting out and you don't know how to efficiently design systems? What if you work in an industry of people under 40 (under 30?) who by definition can't have the kind of experience that would let you design small systems? What if you work for a company that doesn't want to give you the time to figure it out? What if you have a competitor that's just shipping trash and forcing you out of the market with a "we'll clean up the tech debt later" mentality? How do we make tools that are powerful enough to enable us to do hard things, but wise enough to keep us from doing foolish things? Should we rely on our tools to do this? Is this an argument against centralization? Isn't that deeply inefficient?

Honestly I struggle with this all the time. I've got the outlines of some ideas and there are some tech communities grappling with the same issues, but I think we're just at the very beginning of a rethink about how we do software.

[0]: https://liam-on-linux.dreamwidth.org/88032.html


> How do we make tools that are powerful enough to enable us to do hard things, but wise enough to keep us from doing foolish things? Should we rely on our tools to do this?

IMO: No we shouldn't. And that's why I'm in the "Think about it yourself, bulid your ETL with a clean structure so you know what your 1,000 [SQL|shell] scripts do" in stead of the "Let your tool figure it out for you" one.

> Is this an argument against centralization? Isn't that deeply inefficient?

Or for centralization, if we hark back to the "You can get a loooong way on a single server" argument above. Depends on how you define "centralization", I suppose. (Seems to me everything always depends on your definitions, in the end.)

But wait, does this still have anything to do with IDEs? As opposed to Vim (or, say, Notepad)? If so, I'd say IDEs also belong to the category good centralization. At least older ones, before they became desktoppified Web apps with baked-in AS (Artificial Stupidity)...

> Honestly I struggle with this all the time. I've got the outlines of some ideas and there are some tech communities grappling with the same issues, but I think we're just at the very beginning of a rethink about how we do software.

On that debate I'm not so much conservative as... Reactionary.


How many models are in "run all etl", and what if one fails? You can't see how issues can arise here, with what you describe? I mean, go ahead and write your SQL pipeline in bash. Many of us are using dbt because we see a lot of benefit, we aren't all morons.


> How many models are in "run all etl", and what if one fails?

hundred tables. one fail -> whole pipeline fails to ensure e2e and overall consistency.


> and why would you need this?

So things finish considerably faster?

> then optimizing SQL is likely much more beneficial

This only gets you so far. Running transformations one-at-a-time when the server can do more is just holding you back. My last dbt project had over 300+ tables and running transformations serially would have taken an order of magnitude longer than running concurrently.


> Running transformations one-at-a-time when the server can do more is just holding you back

in general, the end goal of such ETL optimization is to have all your CPU or IO saturated. But I imagine for 300 tables transformations it could be too much work.


The goal of ETL is to transform data for practical applications, for which CPU and memory is there to be utilized.

This particular dbt process was running against the second smallest Redshift instance, which barely noticed when the process kicked off every 10 minutes to run incremental updates. Would run for maybe 60-90 seconds, with only the occasional lock contention with long-running analyst queries that was easily mitigated.


> The goal of ETL is

I said goal of "such ETL optimization", not ETL.

Luckily for me it is an hour and 50 lines(likely less) of code to write such parallel executor with dependency tracking if need will arise.


I had the misfortune of having to reimplement dbt at work (we have a weirdo "don't directly connect to app DBs" rule so I had to add support for running arbitrary Python to fetch data from our microservices over gRPC), and yeah the dependency tracking, subsetting, testing, parallelism and such is super easy -- it's more like 1000 lines of Python/SQL for the basics, but yeah no sweat really.

Where we ran into issues was:

- Availability: the general model of dbt is DROP the model (OK by default they TRUNCATE but w/e) and rebuild it. But this means that model's unavailable (read: locked) during rebuilding. Further, you can't do that if any downstream view depends on your model. Even if you decide it's OK to drop the downstream view, rebuild all its upstream dependencies, then rebuild it, it's unavailable the whole time. Further, you probably don't have just one downstream view depending on these upstream models, so you've got to work out all the downstream views to DROP and then rebuild, which is like a secondary dependency tree to walk while walking the primary dependency tree. But, that mind-bending weirdness aside, you're essentially making your whole DB unavailable during a build. Sometimes this is OK, for us it really wasn't.

- Incremental tables added a lot of complexity: rebuilding bigger tables that mostly don't change is a big waste of time and resources. But this is an entirely parallel code path which more or less doubled our code burden.

We tried pretty hard to manage these, which ultimately was a big rabbit hole. We eventually settled on schema versioning where we'd build a whole new DB in a different schema, run tests on it, and promote it if it passed. This let us avoid both the availability and incremental ditches, and as a bonus we could instantly restore past schemas if we discovered issues (restoring DB backups is sloooooooow and lossy).

But, I think the engineering lesson here was we either should have rescinded that weirdo app DB rule, or restricted our custom work to ELTing the data out of our apps and used dbt downstream of that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: