Welcome to the HCL Connections Product Ideas Lab! The place where you can submit product ideas and enhancement request. We encourage you to participate by voting on, commenting on, and creating new ideas. All new ideas will be evaluated by the HCL Product Management & Engineering teams, and the next steps will be communicated. While not all submitted ideas will be executed upon, community feedback will play a key role in influencing which ideas are and when they will be implemented.
For more information and upcoming events, please visit our HCL Connections page.
Here you can see an example of the "info" tab from such a file I mentioned in the last comment:
The file was created 2018 and multiple persons are actively working on it (last change was yesterday)
Current version is 936 (so nearly 1k versions yet)
63,4 MB is the currently latest version file size
All versions require 49,6 GB of space
I still can't find any solution for this, which surprises me, since this means that at least long-living, frequently used files will create a lot of trashed data in CNX. This may be negligible for smaller files, but we all know how tools like MS office are misused and the fact, that those files are handled as binary.
Examples of the problem
For example, our users have an extremely large excel file of about 64 MB, which is frequently used and updated since 2017. So there are nearly 1000 versions of this single file. In total, all those versions requires about 50 GB of storage.
This is just a simple example. We also have MS PowerPoint presentations where this is even worse, since large media files like (high resolution) images and videos are included. As a result, those files may have hundreds of MB in each version. With a bunch of files creating XX GB of versions, this adds up to hundreds of GB or even some TB (depending of the size and user activity), where at least a significant part of those data is not really used/required.
How MS offie files are handled
It's also noticeable that those files were handled as binary, which is part of the problem: MS office files are compressed archives containing xml files and the embedded objects. If CNX would extract those files and compare the content with the previous version, a lot of wasted space could be saved, since this delta only contains changed items (like some text) without saving a copy of e.g. images or other things, which were already on the server in the previous version.
But this is another topic, it's just important to understand, since other tools like git have efficient deduplication, so nobody would expect that one larger commit would blow up all further ones.
I see MS office files as some common, but special case, since you can't resolve any file problems like that. For example, video files which are directly saved, can't be saved more efficient this way. So we need usefull limits for the file versions anyway, even when CNX would handle those office files more smarter.
Possible solutions for CNX
Imho, the main point is to have some kind of limits, which can be adjusted (I'd prefer files-config.xml). A cron job checks those limits regularly. If a file reaches them, older versions are deleted to fit in the policy.
It's not so easy to define such a policy. I don't want to forcibly delete things. Having access to file versions is very usefull, those limits shouldn't kill data which someone needs. It only should delete very old/redundant data, which for sure is not needed any more.
I'd define this as conditions, like maximum age, maximum amout of versions and maximum space used per file. This should avoid to unnecessary delete small files with many versions, where huge files may only have a few versions because of their size. If those conditions are cofigureable, everyone can choose settings who fits best for their environment.
It would also be usefull to specify a minimum amout of versions to keep. This is important for all cases (also manual), because some files are archived, so nobody works on them (no writes) any more. If we just specify to delete files older than e.g. a year, those files won't have any versions after that time. This may be not wanted by the users, if they need to track changes after those fime.
Maybe having an automated solution is not suiteable for everyone, since they need to archive things or something like that. Additionally, it should be possible to clean up large files with many versions manually using tools like wsadmin.
For example, in our excel file with nearly 1k versions, we can delete all versions older than 01.01.2022. Or even better: Combining them, like it's done on backups, so that e.g. one version per week is kept after a certain date. So there are still older versions, if some users need to look at them.
Delete versions older than timestamp X
Delete versions older than timesetamp X. For older files, keep Y per week
Delete all old versions, but keep at least X versions
Delete version 300 to 600
Delete old versions to meet a maximum amout of space for all versions (e.g. we have 50 GB of versions, now old versions are deleted to reduce this to 30 GB)
Getting an overview
As an administrator, it's important to see how big this issue is in my environment. So we should have a dashboard, which lists files by their total space (all versions), the amout of versions count, date of the first + last version. In the best case, we can delete them directly there by conditions. But at least it should create an overview.
Currently, this is possible by reverse-engineering the database and write some SQL queries. Especially on the FILES.MEDIA and FILES.MEDIA_REVISION tables. This shouldn't be required for basic features.
Additionally, it would be even better when end-users can control their versions on specific files. This is not a thing for everyone, but for those persons working on large files with many versions.
Currently, we can't really navigate to specific versions easily: The web UI shows the latest 10 versions with a "load more" button. To delete the oldest, the user has to click this link 100 times (!) to just view the oldest version entry, in my 1k versions example. Then he had to delete every single version.
This is not practical. Better would be to have pagination here, so we can easily navigate to the latest page. And if the user want to do batch deletion, he needs tools like described above to delete e.g. older versions after a specific date. Or delete version 300 to 600.
I'm agree with that.
I think it could be a general parameter for each company or by user or by community.