I hate Clouds - a personal perspective on why I think Clouds suck

loudwhisper@infosec.pub · 12 hours ago

Thanks for the feedback, and same to @ilmagico@lemmy.world and @jg1i@lemmy.world. I fixed the configuration of the site and now the site should be readable even in light mode.

loudwhisper@infosec.pub · 22 hours ago

I am sorry! As an amateur landscape photographer I actually like very much those clouds. There are a few r-word posts about people hating those clouds though, but I checked and they are nowhere near as long as you would expect a proper rant to be

loudwhisper@infosec.pub · edit-2 21 hours ago

In most cases! Sorry, I simply don’t believe it. Once you operate for 5, 10, 20 years not having capitalized anything is expensive as hell, even without the skill issue (which is not a great argument, as it is the case for almost anything).

It’s almost always the case with rent vs invest.

Do you have some numbers?

I cite a couple of articles in the post, and here is a nice list of companies and orgs that run outside the Cloud (it’s a bit old!) or decided to move away. Many big companies with their own DC, which is not surprising, but also smaller (Wikipedia!).

37signals also showed a huge amount of savings (it’s one of the two links in the post) moving away from the cloud. Do you have any similar data that shows the opposite (like we saved X after going cloud)? I am genuinely curious

Edit: here is another one https://tech.ahrefs.com/how-ahrefs-saved-us-400m-in-3-years-by-not-going-to-the-cloud-8939dd930af8 Looking solely at the compute resources, there was an order of magnitude of difference between cloud costs and hosting costs (x11). Basically a value comparable (in reality double) to the whole revenue of the company.

loudwhisper@infosec.pub · 1 day ago

I will have a look if there is something that suggests how to “make” a light theme. Thanks for the info!

loudwhisper@infosec.pub · 1 day ago

Redundancy should be automatic. Raid5 for instance.

Yeah it should, but something needs to implement that. I mean, when distributed systems work redundancy is automatic, but they can also fail. We are talking about redundancy implemented via software, and software has bugs, always. I am not saying that it can’t be achieved, of course it can, but it has a cost.

You can have an oracle (or postgres, or mongo) DB with multi region redundancy, encryption and backups with a click.

I know, and if you don’t understand all that complexity you can still fuckup your postgres DB in a disastrous way. That’s the whole point of this thread. Also operators can do the same for you nowadays, but again, you need to know your systems.

Much, much simpler for a sysadmin (or an architect) than setting the simplest mysql on a VM.

Of course it is. You are paying someone else for that job. Not going to argue with that. In fact, that’s what makes it boring (which I talked about in the post).

Unless you’re in the business of configuring databases, your developers should focus on writing insurance risk code, or telco optimization, or whatever brings money.

This is a modern dogma that I simply disagree with. Building an infrastructure tailored around your needs (i.e., with all you need and nothing else) and cost effective does bring money, it does by saving costs and avoiding to spend an enormous amount of resources into renting all of that, forever, scaling with your business.

You can build a redundant system in a day like Legos, much better security and higher availability (hell, higher SLAs even) than anything a team of 5 can build in a week self-manging everything.

This is the marketing pitch. The reality is that companies still have huge teams, still have tons of incidents, still take long to deliver projects, still have security breaches, but they are spending 3, 5, 10 times as much and nothing of those money is capitalized.

I guess we fundamentally disagree, I envy you for what positive experiences you must have had!

loudwhisper@infosec.pub · 1 day ago

Instant transactions are periodic, I don’t know any bank that runs them globally on one machine to compensate for time zones.

Ofc they don’t run them on one machine. I know that UK banks have only DCs in UK. Also, the daily pattern is almost identical everyday. You spec to handle the peaks, and you are good. Even if you systems are at 20% half the day everyday, you are still saving tons of money.

Batches happen at a fixed time, are idle most of the day.

Between banks, from customer to bank they are not. Also now most circuits are going toward instant payments, so the payments are settled more frequently between banks.

My experience are banks (including UK) that are modernizing, and cloud for most apps brings brutal savings if done right, or moderate savings if getting better HA/RTO.

I want to see this happening. I work for one and I see how our company is literally bleeding from cloud costs.

But that should have been a lambda function that would cost 5 bucks a day tops

One of the most expensive product, for high loads at least. Plus you need to sign things with HSMs etc., and you want a secure environment, perhaps. So I would say…it depends.

Obviously I agree with you, you need to design rationally and not just make a dummy translation of the architecture, but you are paying for someone else to do the work + the service, cloud is going to help to delegate some responsibilities, but it can’t be cheaper, especially in the long run since you are not capitalizing anything.

loudwhisper@infosec.pub · 1 day ago

Why not outsourcing just the hardware then? Dedicated servers and Kubernetes slapped on them. Hardware failure mitigated for the most part, and the full effort goes into making the cluster as resilient as possible, for 1/5 of the cost of AWS. If machines burn, it’s not your problem (you can have them spread over multiple sites, DCs, rooms, racks) anymore.

loudwhisper@infosec.pub · 1 day ago

Systems are always overspecced, obviously. Many companies in those industries are dynosaurs which run on very outdated systems (like banks) after all, and they all existed before Cloud was a thing.

I also can’t talk for other industries, but I work in fintech and banks have a very predictable load, to the point that their numbers are almost fixed (and I am talking about UK big banks, not small ones).

I imagine retail and automotive are similar, they have so much data that their average load is almost 100% precise, which allows for good capacity planning, and their audience is so wide that it’s very unlikely to have global spikes.

Industries that have variable load are those who do CPU intensive (or memory) tasks and have very variable customers: media (streaming), AI (training), etc.

I also worked in the gaming industry, and while there are huge peaks, the jobs are not so resource intensive to need anything else than a good capacity planning.

I assume however everybody has their own experiences, so I am not aiming to convince you or anything.

loudwhisper@infosec.pub · 1 day ago

I am specifically saying that redundancy doesn’t solve everything magically. Redundancy means coordination, more things that can also fail. A redundant system needs more care, more maintenance, more skills, more cost. If a company decides to use something more sophisticated without the corresponding effort, it’s making things worse. If a company with a 10 people department thinks that using Cloud it can have a resilient system like it could with 40 people building it, they are wrong, because they now have a system way more complex that they can handle, despite the fact that storage is replicated easily by clicking in the GUI.

loudwhisper@infosec.pub · 2 days ago

Proton runs fully on their own hardware, they have some positions open!

loudwhisper@infosec.pub · 2 days ago

No, it’s not true. A single system has less failure scenarios, because it doesn’t depend on external controllers or anything that makes the system distributed and that can fail causing a failure to your system (which may or may not be tolerated).

This is especially true from a security standpoint: complexity adds attack surface.

Simple example: a kubernetes cluster has more failure scenarios than a single node. With the node you have hardware failure, misconfiguration of the node, network failure. With a kubernetes cluster you have all that for each node (each with marginally less impact, potentially, because it depends for example on stateful storage, that if you mitigate you are introducing other failure scenarios as well), plus the fact that if the control plane goes in flames your node is useless, if the etcd data corrupts your node is useless, anything that happens with resources (a bug, a misuse of the API, etc.) can break your product. You have more failure scenarios because your product to run is dependent on more components to work at the same time. This is what it means that complexity brings fragility. Looking from the security side: an instance can be accessed only from SSH, if you are worried about compromise you have essentially one service to secure. Once you run on kubernetes you have the CI/CD system, the kubernetes API, the kubernetes supply-chain, etcd, and if you are in cloud you have plenty of cloud permissions that can indirectly grant you access to the control plane and to a console. Now you need to secure 5-6-7 entrypoints to a node.

Mind you, I am not advocating against the use of complex systems, sometimes they are necessary, but if the complexity is not fully managed and addressed, you have a more fragile system. Essentially complexity is a necessary evil to respond to some other necessities.

This is the reason why nobody would recommend to someone who needs to run a single static website to run it on Kubernetes, for example.

You say “a well designed system”, but designing well is harder the more complexity exists, obviously. Redundancy doesn’t always work, because redundancy needs coordination, needs processes that also depend on external components.

In any case, I agree that you can build a robust system within Cloud! The argument I am trying to make is that:

you need to be aware that you are introducing complexity that needs attention and careful design if you don’t want it to result in more fragility and exposure
you need to spend way more money
you need to balance the cost with the actual benefits you are gaining

And mind you, everything you can do in Cloud you can also do on your own, if you invest on it.

loudwhisper@infosec.pub · 2 days ago

If your compute needs expand that much everyday, and possibly shrink in others, than your use-case is one that can benefit from Cloud (I covered this in the post).

That said, if provisioning means recycle, then it’s obviously not a problem.

This is a very rare requirement. Most companies’ load is fairly stable and relatively predictable, which means that with a proper capacity planning, increasing compute resources is something that happens rarely too. So rarely that even a lead time for hardware is acceptable.

So if I may ask (and you can tell), what is the purpose of provisioning that many systems each day? Are they continuously expanding?

loudwhisper@infosec.pub · 2 days ago

With a lot of stuff on top!

loudwhisper@infosec.pub · 2 days ago

Not OP, but they are comparable efforts, especially since it’s a relatively infrequent activity. You can rent dedicated boxes with off-the-sheld hardware almost instantly, if you don’t want to deal with the hardware procurement, and often you can do that via APIs as well. And of course both options are much, much, much cheaper than the Cloud solution.

For sure speed in general is something Cloud provide. I would say it’s a very bad metric though in this context.

loudwhisper@infosec.pub · 2 days ago

Of course the problem is solved, but that doesn’t mean that the solution is easy. Also, distributed protocols still need to work on top of a complicated network and with real-life constraints in terms of performances (to list a few). A bug, misconfiguration, oversight and you have a problem.

Just to make an example, I remember a Kafka cluster with 5 replicas completely shitting its pants for 6h to rebalance data during a planned maintenance where one node was brought offline. It caused one of the longest outages to date with the websites which relied on it offline. Was it our fault? Was it a misconfiguration? A bug? It doesn’t matter, it’s a complex system which was implemented and probably something was missed.

Technology is implemented by people, complexity increased the chances of mistakes, not sure this can be argued.

Making it harder to identify SPOF means you might miss your SPOF, and that means having liabilities, and having anyway scenarios where your system can crash, in addition for paying quite a lot to build a resilience that you don’t achieve.

A single instance with 2 failure scenarios (disk failure and network failure) - to make an example - is not more fragile than a distributed system with 20 failure scenarios. Failure scenarios and SPOF can have compensating controls and be mitigated successfully. A complex system where these can’t be fully identified can’t have compensating control and residual risk might be much harder. So yes, a single disk can fail more likely than 3 disks at once, but this doesn’t give the whole picture.

loudwhisper@infosec.pub · edit-2 2 days ago

Thanks, that is a very good observation! I will try to sneak an edit later today where I can add some appendix about acronyms and abbreviations.

Edit:

While it might not look great, I have added at the bottom an Appendix with all (hopefully, I might have missed some) acronyms and abbreviations. Thanks for the suggestion!

loudwhisper@infosec.pub · 2 days ago

I wish it worked like that, but I donct think it does. Connecting clouds means introducing many complex problems. Data synchronization and avoiding split-brain scenarios, a network setup way more complex, stateful storage that needs to take into account all the quirks and peculiarities of all services across all clouds, service accounts and permissions that need to be granted and segregated for all of them, and way more. You may gain resilience in some areas, but you introduce a lot more things that can fail, be misconfigured or compromised.

Plus, a complex setup makes it harder by definition to identify SPOFs, especially considering it’s very likely nobody in the workforce is going to be an expert in all the clouds in use.

To keep using your simile of the disks, a single disk with a backup might be a better solution for many people, considering you otherwise might need a RAID controller that can fail and all the knowledge to handle and manage a RAID array properly, in addition to paying 4 or 5 times the storage. Obviously this is just to make a point, I don’t actually think that RAID 5 vs JBOD introduces comparable complexity compared to what multi-cloud architecture does to single-cloud.

loudwhisper@infosec.pub · 2 days ago

Complexity brings fragility. It’s not about doing the job right, is that “right” means having to deal with a level of complexity, a so high number of moving parts and configuration options, that the bar is set very high.

Also, I would argue that a large number of organizations don’t actually need the resilience that they pay a very high price for.

loudwhisper@infosec.pub · 2 days ago

Yeah in general you can’t mess the building blocks from the PoV of availability or internal design. That is true, since you are outsourcing it. You can still mess them up from other points of view (think about how many companies got breached due to misconfigured S3 buckets).

loudwhisper@infosec.pub · 2 days ago

Thanks! I went and tried on my phone and indeed setting Firefox to light mode indeed causes that horrendous and unreadable result. I will need to figure out way, eventually, and provide an alternative light scheme.

loudwhisper@infosec.pub · 3 days ago

I hate Clouds - a personal perspective on why I think Clouds suck