-
Notifications
You must be signed in to change notification settings - Fork 17
S3 Backup design doc #2627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
S3 Backup design doc #2627
Conversation
doc/design/s3-backup.md
Outdated
|
|
||
| The DANDI Archive is expecting a ramp-up in data volume of 6 PB of new data over each of the next five years, culminating in a total of 30PB. | ||
|
|
||
| Scaling up the previous analysis means that the monthly costs will be projected to rise to a total of **~$31,000/month** once all of that data is seated. While $1000/month may be a feasible ongoing cost, $30000/month is not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With numbers on this scale it would be useful to compare total cost to equivalent effort of just enhancing the size of the MIT Engaging partition
Also worth pointing out long-term sustainability costs since all AWS costs are dependent on time (cost per month for all time) whereas the MIT backup being on equity hardware is less so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numbers are not as bad now that we've corrected the data volume assumption. It may still be a good idea to compare to MIT Engaging, though we will want to take into account non-financial costs if we do that (engineering time, differences in reliability, and the ongoing costs in those terms of maintaining that backup long-term).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we currently storing data on that partition? DANDI uses S3 (and, in theory, MinIO for local development or non-standard cloud deployment); what kind of storage system do the Engaging backups use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output of df -hT .
Filesystem Type
hstor004-n1:/group/dandi/001 nfs4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kabilar Do you recall the quote from MIT for expanding that storage? Were there long-term costs or just a one-time deal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quotes obtained for MIT - can close this thread (I don't have permission to)
We are no longer considering the use of a bucket in a different region.
dd89dbd to
333137a
Compare
We are expecting a bulk of 6PB over the next 5 years, not 30PB.
333137a to
da16c20
Compare
2efa373 to
7429734
Compare
7429734 to
3e74365
Compare
doc/design/s3-backup.md
Outdated
| Scaling up the previous analysis means that the monthly costs will be projected to rise to a total of **~$6,100/month** once all of that data is seated. | ||
| The worst-case disaster recovery cost would similarly scale up to a total of **~$16000**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would appreciate seeing a table of cost estimates per year assuming a 1PB increase per year (plus perhaps an extra 500 TB worst-case jump in the next year due to Kabi's latest LINC estimate), with a grand total after 5 years in the last column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Table is in other document (comment can be closed)
Co-authored-by: Cody Baker <51133164+CodyCBakerPhD@users.noreply.github.com>
Co-authored-by: Cody Baker <51133164+CodyCBakerPhD@users.noreply.github.com>
When the AWS docs say "GB", they mean 10^9 bytes, not 2^30. Co-authored-by: Cody Baker <51133164+CodyCBakerPhD@users.noreply.github.com>
Co-authored-by: Cody Baker <51133164+CodyCBakerPhD@users.noreply.github.com>
Clarify purpose of calculating the expected bucket storage cost covered by AWS already.
| $$ | ||
|
|
||
| while the associated backup costs would represent only an additional $`\$5900 / \$126000 \approxeq 4.6\%`$ of the cost of the storage itself. | ||
| To help provide a significant level of safety to an important dataset, AWS may be willing to cover such a low marginal cost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| To help provide a significant level of safety to an important dataset, AWS may be willing to cover such a low marginal cost. | |
| To help provide a significant level of safety to an important database, it may be worth reaching out to see if AWS may be willing to cover such a low marginal cost. |
Original wording sounds as if we are speaking for AWS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although - given their previous seeming lack of concern for applying glacier to the main archive contents (to 'save ephemeral costs'), I am guessing their perspective is less about the monetery aspect (which is being waived either way) than it is about actual additional storage at the data center (essentially doubling the size of the archive, even as it grows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this line as it has been confirmed that the Open Data program will not cover backup.
|
@satra Two things relevant to this discussion
|
| ## Cost | ||
|
|
||
| **Below are the additional costs introduced by this backup feature** for a 1 PB primary bucket (assuming both the primary bucket and backup bucket are in us-east-2). All of this information was gathered from the "Storage & requests", "Data transfer", and "Replication" tabs on [https://aws.amazon.com/s3/pricing/](https://aws.amazon.com/s3/pricing/). | ||
|
|
||
| **Storage Costs** (backup bucket in us-east-2): | ||
|
|
||
| - Glacier Deep Archive storage: ~$0.00099/GB/month | ||
| - 1 PB = 1,000 TB × $0.99/TB = **$990/month** | ||
|
|
||
| **Data Transfer Costs**: | ||
|
|
||
| - Same-region data transfer between S3 buckets is free | ||
|
|
||
| **Retrieval Costs** (only incurred when disaster recovery is needed): | ||
|
|
||
| - Glacier Deep Archive retrieval: | ||
| - $0.02/GB (standard, 12-hour retrieval) | ||
| - $0.0025/GB (bulk retrieval, can take up to 48 hours) | ||
|
|
||
| Imagining that the entire primary bucket was destroyed (which is not the | ||
| expected scale of data loss, but useful as a worst-case analysis), then the cost | ||
| to restore from backup would be | ||
|
|
||
| $`1\ \rm{PB} \times \frac{1000 \ \rm{TB}}{\rm{PB}} \times \frac{1000 \ \rm{GB}}{\rm{TB}} \times \frac{\$0.0025/mo}{\rm{GB}} = \$2500`$. | ||
|
|
||
| ### Future Costs | ||
|
|
||
| The DANDI Archive is expecting a ramp-up in data volume of 1 PB of new data over each of the next five years, culminating in a total of 6PB. | ||
|
|
||
| Scaling up the previous analysis means that the monthly costs will be projected to rise to a total of **~$5,900/month** once all of that data is seated. | ||
| The worst-case disaster recovery cost would similarly scale up to a total of **~$16,000**. | ||
|
|
||
| An open question is whether the AWS Open Data Sponsorship program would cover the marginal costs of backup. A quick estimate shows that once all 5 PB has been uploaded, the expected bucket cost for the primary bucket (i.e., what the AWS Open Data Sponsorship program covers already, excluding backup) will be: | ||
|
|
||
| $$ | ||
| 6\ \rm{PB} \times \frac{1000\ \rm{TB}}{\rm{PB}} \times \frac{1000\ \rm{GB}}{\rm{TB}} \times \frac{\$0.021/mo}{\rm{GB}} \approxeq \$126000/mo | ||
| $$ | ||
|
|
||
| while the associated backup costs would represent only an additional $`\$5900 / \$126000 \approxeq 4.6\%`$ of the cost of the storage itself. | ||
| To help provide a significant level of safety to an important dataset, AWS may be willing to cover such a low marginal cost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@satra The design here looks good, but the main open question and blocker is cost.
Would the AWS Open Data program be open to covering the back up storage costs as well (~$12,000/year for a 1 PB backup, and ~$72,000/year for the projected 6 PB backup)? See the doc for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See table here: https://github.com/CodyCBakerPhD/dandi-archive/blob/f9a19031a1a40f5fa71cfdaf42a7b6bd61e934dd/doc/design/s3-backup-nese.md#future-costs
for best comparison
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they won't cover backup. we can cover deep glacier equivalent to start with and then figure out from there on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might I suggest that if we are to move forward with S3 replication + deep glacier strategy, perhaps we should start off by testing it with only a certain percentage of assets while closely monitoring costs to ensure our predictions are accurate?
(and while we're at it, attempt some limited 'replication' tests to ensure we know entirely how that process works and guarantee that it behaves as expected in addition to restoration pricing also meeting predictions)
This PR lays out a design for S3 backup using S3 Replication and the Glacier Deep Storage class. Related #524