Data is everywhere. We generate data every second. Even now, while I am typing this I generate 1 byte of data with every symbol typed. And in the background I send to Spotify the fact that I’m listening to music. My smart bulbs generate light and data that the light is turned on. And if I wore a wearable of some kind I would be generating data about my steps, fitness routines etc. And there are millions of dudes like me.
It is said that more than 100 million spam emails are sent every minute. Netflix users stream more than 70 thousand hours of video every minute. YouTube users watch more than 4 million videos every minute. It is predicted that the Global Datasphere – which is all the data that’s collected, everywhere – will grow from 33 zettabytes last year to 175 zettabytes by 2025. (Yeah, zettabyte has 21 zeros, and yeah, it’s a huuuuge number).
The data can be structured (stored in relational databases), semi-structured (usually stored in NoSQL databases in the form of key-value objects or even XML or JSON files) and unstructured data (which is actually anything else: files, emails, music, video, images, social media content etc.)
The tricky thing here is that structured data represents only 10% of all Datasphere, semi-structured has it’s own 10% of the share and the biggest part belongs to unstructured data – which is the hardest to process and analyze.
Modern data management platforms must capture data from diverse sources at speed and scale. Data needs to be pulled together in manageable, central repositories—breaking down traditional silos. The benefits of collection and analysis of all business data must outweigh the costs.
Businesses have often struggled to build such systems that can efficiently collect, store and process the data to get valuable insights. It seems like they collect tons of data and it is just gets frozen and lost on some on-premise IBM mainframes or “good old FTP” XD.
So how does cloud technologies change that?
S3: Why The Three S’s?
Because it is Amazon Simple Storage Service – Amazon S3. Yes, I will be talking about Amazon Web Services (AWS), but I am pretty sure that other cloud providers (GCP, Azure) provide similar services.
I actually answered to the question in the title to this paragraph, but let´s be honest: for a tweet it’s cool, for a paragraph of the article – not really. So what is S3 and how it helps solving the problem of zettabytes of data?
S3 is the storage for internet. And for developers. I mean, if you need your personal 1TB on the cloud, use Google Drive, Microsoft OneDrive, Dropbox or any other cloud solution destined for that. S3 will just cost you more.
S3 is the storage of objects referring to the file. Object is not just a file. It is a file + its metadata. And also object can be any kind of file + its metadata. You literally can store anything there. S3 is secure, highly scalable, natively online and with the HTTP access, and it has 99.999999999% durability. Yes, there is no 100% confidence. Everything can be blown off, but there are so many precautions taken that it is highly improbable, although S3 tells you – “Mate, if the moon will be in the 27th fase and at the same time the earthquake will start in Indonesia and tsunami will drown Switzerland and at the same exact time snow will cover entire Brazil and the same time there will be sun eclipse and at the same time John that lives in New Zealand will lose his 3 sheeps: 1 black, 1 white and 1 red, then there is a possibility you might lose your data too.”
What? Weird things happen all the time.
Basic S3 concepts
Where I was? Ah, objects in S3. And access to them. And other cool features. First of all you have to understand that S3 stores objects in buckets. Yes. Buckets.
Those buckets are pretty cool though. It is some kind of folder, but more sophisticated. They are logical containers of the objects. You can have one or many buckets and you can set up access permissions on each of them. You select who can create, delete or list objects in each and every bucket. Also you can view the access logs for your buckets and select geographical region where S3 will store your buckets and it’s content.
Accessing your dirty objects
I hope you don’t store any spicy videos there! Even if you do, you still have to know how to access them.
Once the object is stored in the S3 it receives it’s own and unique object key. Below is an example of a URL for a single object in a bucket named doc, with an object key composed of the prefix 2006-03-01 and the file named AmazonS3.html.
The combination of bucket, key and version ID becomes the unique identifier of each and every object in the storage. And by using this combination + web service endpoint every object can be uniquely addressed (version though is optional).
Pricing. Yes, you will have to pay for it.
Actually AWS charges you for any action you make inside the S3, but no worries, the price is quite reasonable. Below you will see the screenshot of a basic estimate for 1TB of storage on S3 and how much you will have to pay for it. And you will also understand that for personal usage (I mean storing cute photos and videos of your dog or cat) it is not that good. Just because of the amount of parameters you have to take into account.
Data analysis solutions with S3
Well, S3 is actually very basic service. You store data there and that’s it. Point. But when you combine it with other cloud services it opens a lot of doors in the world of development:
- Decoupling compute and store – now you can store all your data in one place and just create little microservices (or use EC2 virtual machines) that will make use of that data.
- Centralized data architecture – all your data is in one place and you can just connect multiple tools to make use of it and it won’t be necessary to distribute thousands of copies of same excel files all over again.
- Integration with clusterless and serverless AWS services – yes, you can just write some serverless function/app/script using AWS Lambda and it will work perfectly with S3. Or you can use Amazon Athena to query data from S3 using SQL without ingesting this data into relational database.
- Standardized APIs – S3’s RESTful APIs are simple, easy to use, and supported by most major third-party independent software vendors (ISVs), including Apache Hadoop and other leading analytics tool vendors. Even kid could manage it XD.
More and more businesses move to the cloud, because… Well because it makes sense, it makes the life of developers easier and it is just more efficient. And it’s cool! The future is in the cloud. So now when someone is asking why are you still somewhere in the clouds you can respond with absolute confidence – “I am learning” 🙂