>>50358
I swear, you would think that first sentence you wrote was true, but unfortunately it's not. I worked with a place that had a few hundred TB worth of files, all of it valuable, most of it irreplaceable, and their idea of a "backup" was effectively "what's that?" Now this place was run by a very "special" CEO whose only real skill was getting employees, contractors, and even interns to ask "how the hell are we still in business," so I'll admit it's not what you'll typically see, but regrettably, it does exist. Under normal circumstances I would have recommended terminating the sysadmin (or sysadmins) responsible for this mess, but it didn't take long for me to realize that the sysadmin had no power whatsoever and was basically at the mercy of wildly incompetent management. Requests for everything from a backup server to offsite backups were typically laughed off as soon as budgets were put forward because of how cheap this place was. Don't even get me started on how they handled material that should have been encrypted and stored either offline or on an air-gapped server.
Also, I should probably clarify what I mean when I refer to a "NAS." I'm not referring to data center level equipment, but rather to the cheap shit sold by companies like Synology, QNAP, Ugreen, and formerly Drobo--stuff that stupid people buy because they think it's affordable, which it is until they lose all of their data. I'll typically just use the term "server" to distinguish data center level gear from the aforementioned NAS systems.
Consider yourself lucky that you've never had a backplane go out. I've seen a few fail over the years, and they're never a fun replacement. Power supplies and drives definitely go far more frequently though. (Drive failure? Must be a Thursday again.) Actually, now that I think about it, I've probably seen almost every server component imaginable fail at least once. ECC RAM? Check. NIC Card? Check. RAID card and/or port expanders? Pretty often. Backplane? Like it said it's not common, but I've seen it more than I've seen high quality USB enclosures fail. Power Supplies/Drives? Well duh, that's a common occurrence. CMOS battery? Not as common but yeah, dealt with that on a few occasions. Really rare failures like a power switch, and VGA port? Yep, dealt with those too. (Absolutely hated dealing with them too because they would always only even be discovered when there was already something else wrong with the server.) Motherboards and CPUs? Yep, they fall somewhere between power supplies/drives and RAID Cards/port multipliers in the frequency of failures.
Obviously if you have access to new data center level equipment, then by all means use it, but I wrote my original recommendation for people who typically don't have the budget/experience for working with data center level gear, or who make the mistake of mixing data center gear with inferior "gaming PC" equipment. This is cute until it fails spectacularly because the stuff designed for gaming PCs isn't built to the same standard as the stuff designed for data centers, despite what manufacturers might claim.
FWIW, typically I've found people (not corporate clients) tend to divide into four or five groups when it comes to backups/data archival
1. Those who have no backups and have never bothered to even consider them. They used to wield laptops, now they typically wield phones.
2. Those who only want/have cloud-based backups. This is cute until their gadget of choice fails and they're spending days waiting to recover their data because they don't have a fiber line and half of them don't even know what an ethernet cable is, let alone why they should be plugging one in and using it instead of WiFi. (This is also fun when the backup has never been tested, is more corrupt than not because it was made over the span of days on a shoddy WiFi connection, and -insert priceless file here- is lost as a result.)
3. Those who only want/have local, onsite backups. These people are just as bad as the Cloud-only people. Occasionally they'll have more than one drive's worth of data, but usually all of their stuff is stored on a single external HDD, and for some reason it's usually a 2.5" drive made by Seagate with an absurdly high failure rate. When it's not a single external HDD, it's a single PC with "redundancy" that completely ignores the fact that RAID/redundancy is not a backup.
4. Datahoarders who have some form of NAS or server holding "hot" data, offline external drive based backups of the NAS/server holding "cold" data, and a cloud-based backup of the same data for added redundancy.
5. People who are sort of the "not datahoarder" subset of the group above. They'll have redundant backups, they may even have geographical backups, but they're typically only using two out of three of the systems the people in group four are using. (That is, they may have external drives and cloud backups but no NAS, or a NAS and external drives, but no cloud backups, or any other permutation thereof.)
I may have to give czkawka a try just so I have an alternative in case my current duplicate finder ever disappears or nosdives in quality for some reason. (Realistically it shouldn't, it's widely used and has been around for almost three decades, but that doesn't guarantee that things won't change one day.) This should be obvious, but I bought my current duplicate finder to inspect all of my files, not just my porn. I often deal with a high quantity of extremely large video files, and I need to know that the copies are absolute bit-level duplicates, not similar files, not almost bit-level duplicates, but 100% identical bit-level duplicates. (The tool I'm using actually has some advantages over traditional methods like comparing MD5 checksums in terms of speed/accuracy too, which is one of the main reasons I like it.) These files typically are part of projects with a fast turnaround time so there's no time to make proxies of them, much less wait days for them to upload to cloud storage where they wouldn't be editable anyway. They have to be dealt with locally and dealt with fast. Once the projects are completed they get moved into "offline" storage which gets backed up physically and geographically. If I find myself having to access one of these projects at least once a year or more, then they wind up also going into cloud storage, and while I don't list it as a line item, I add the cost of that cloud storage to the client's overall bill for the project.
Anyway, I'm pretty sure the main point of this thread wasn't actually about storing our collections as we've been discussing, but rather about building some sort of database/archive/index of diaper porn. If I'm indeed understanding OP's question correctly, than the second paragraph of
>>50299 and your czkawka recommendation are still the most relevant portion of this thread. Presumably the OP's goal isn't to host content, but rather to serve as "the Mr. Skin of Diaper Porn" and the "Google of Diaper Porn" if you will. What the OP needs is some sort of UUID for every piece of diaper porn in existence, (or at least every one he could reasonably track,) and a way to link that UUID to the various sites that hold that content.
>>25688 has lots of good suggestions as well.
The biggest issue I see with 25688's suggestion for creating file hashes to verify that the files being linked to a piece of content's UUID are indeed the same is the fact that most of the paysites have very slight differences in their files, which are just different enough to result in a different set of hashes/checksums. If you run PSNR/SSIM on the files you'll wind up with results of "infinity" indicating that the actual data is the same, but something inside of the container is somehow different. This is already ignoring models/studios who have the same clip available in different file formats, resolutions, and quality options on multiple sites for the moment by the way, and only focusing on files where there's supposedly "one version to rule them all" on every site.
For example, let's pretend that a model named "Hot Diapers" exists and she puts out a video named "Hot Diapers Pisses in Sexy Diapers." We can give our hypothetical model "Hot Diapers" a UUID to identify her as a model, that's easy. We can give our hypothetical video "Hot Diapers Pisses in Sexy Diapers" a UUID that links it to the "Hot Diapers" model's UUID, and everything is still good. Where we run into an issue is with identifying the best quality version of the video, which is the one that people would probably want. We could list that this hypothetical model's hypothetical video is on C4S, JFF, and MV, possibly list other sites it pops up on as well, but recommending the "best quality" version is going to be the tough part. We could rule out JFF based on their mandatory watermarking alone, and probably rule out most of the other sites the thing pops up on. The problem is going to be confirming that the version on C4S and the version on MV are the same video, especially when ABDL models will give their videos just slightly different names depending on the site they're uploading too, and when there's some container-level difference that requires PSNR/SSIM scans to verify that the actual audio and video content is the same in both files despite mismatched checksums. Oh, and just for fun, since OP mentioned "maybe experiment with AI upscaling," let's pretend that some horny bastard took the original file of "Hot Diapers Pisses in Sexy Diapers," upscaled it from 4K to 8K (or 1080p to 4K,) and now there's a 4K version floating around somewhere that looks better than the original. Do we now list this as the "best quality" file and denote the original version as "original" with a link to the original page on C4S/MV, or do we list the "original" file as "best quality" and then the better looking upscaled file as "AI Upscaled." Do we treat AI upscaled files differently than older vids that have downscaled releases in 1080p, 720p, and 480p, or even older releases that came in 480p and 240p varieties?
Also, how do we deal with creators who are no longer active, made some really hot stuff, but have now nuked their C4S/JFF/MV/etc. pages? Do we list them and instead of having a set of "Get it on C4S/JFF/MV/etc." links simply list it as "unavailable," or do we try to at least provide some sort of link to someone who might still be hosting it somewhere? (E.g. "No longer available from C4S/JFF/MV/etc., get it from torrent link/file sharing site link/etc., or stream it on -insert porn streaming site here-" instead.) I'd be in favor of the latter, but that would assume that someone still had the content in question, and that it was accessible without jumping through hoops.