>>25681
i don't think it's that much content
the biggest problem would be is managing some kind of database or CMS collaboratively
create a version of the data that can be considered "complete".
We build a list of creators and their media outlets: paysites having subscriptions and a pay-per-view, artists posting on multiple sites (DA, FA, Twitter, etc), models uploading videos on multiple sites (pornhub, twitter, onlyfans, etc)
Create a list of content, update it if possible, or just write down when it was created, maybe automate some crawling.
Even if the content isn't free, many creators will release a blury preview or the title of what they created to proof that they are active, you can still view the file list of a defunct torrent, etc.
Based on metadata we can know that some postings are the same or if they are important:
* Artists posting new images on all sites they are active → We only keep the version that has the best quality, and write down that there are multiple sources for the same image.
* Models posting preview gif while the full video is on a pay-per-view site → we keep the low quality version till we might get the original
* We only have the thumbnail for a video that is only included in a subscription → We keep the thumbnail as a place holder but we can use that and some metadata to identify it in a older siterip.
* Photo set based on a video → keep both but link then together as part of the same "thing"
* Creator reposting their own stuff → ignore, maybe check if it's different than the original version
* Multiple versions of the same image →keep all, ignore the versions that don't add any information
Download stuff, create file hashes to verify that we are talking about the same bytes.
Remove DRM watermarks by comparing different versions.
Add tags.