## Small Files Archiving Solution This solution is consist of two scripts. The one is to combine many small files into big TARfile, and the other one is to search and retrieve specific file from TAR archived file. This feature will help user reduce cloud storage cost reducing PUT request cost, and transfer to cloud faster than transferring original files. ### Features - Generating tarfiles and uploading to S3 directly - Generating tarfiles and saving to filesystem directory - Providing manifest files which is including tarfile, subset file, date, file size, first block, last block - Finding tarfiles which includes specific subset file by condition, such as filename, date, duration - Retrieving subset file itself from a tarfile in S3 using [byte-range](https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html) ## Pre-requsites - python >= 3.7 - linux - [aws cli](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html) I'll show you how to use this solution. ## Discovering source directory and file structure In manufacture environment, product image files would be stored in local storage based on directory structure “Line/Equipment/Lot/Date”. Especially dividing files by date order is effective to search specific object with index. * Directory Structure Example: /mnt/1Line/1E/1L/2022/08/01 ## Combining small file into big file User can combine files into big file with [s3archiver.py](https://github.com/aws-samples/small-files-archiving-solution/blob/main/s3archiver.py) script. It will scan the source directory and generate 10GB TAR file by default, or you can generate TAR files based on number of source files. Also it generate a manifest file per each TAR file, and manifest file contains tarfile name, subset filename, date(YYYY|MM|DD), file size, start block, last block which can be used to download specific subset file only, not entire tarfile. Below is running script, and I will explain each parameter. ### Parameters os s3archiver.py ```bash usage: s3archiver.py [-h] --src_dir SRC_DIR --protocol PROTOCOL [--prefix_root PREFIX_ROOT] [--max_process MAX_PROCESS] --combine COMBINE [--max_file_number MAX_FILE_NUMBER] [--max_tarfile_size MAX_TARFILE_SIZE] [--bucket_name BUCKET_NAME] [--endpoint ENDPOINT] [--profile_name PROFILE_NAME] [--storage_class STORAGE_CLASS] [--bucket_prefix BUCKET_PREFIX] [--fs_dir FS_DIR] optional arguments: -h, --help show this help message and exit --src_dir SRC_DIR source directory e) /data/dir1/ --protocol PROTOCOL specify the protocol to use, s3 or fs --prefix_root PREFIX_ROOT prefix root e) dir1/ --max_process MAX_PROCESS NUM e) 5 --combine COMBINE size | count, if you combind files based on tarfile size, select 'size', or if you combine files based on file count, select 'count' --max_file_number MAX_FILE_NUMBER max files in one tarfile --max_tarfile_size MAX_TARFILE_SIZE NUM bytes e) $((1*(1024**3))) #1GB for < total 50GB, 10GB for >total 50GB --bucket_name BUCKET_NAME your bucket name e) your-bucket --endpoint ENDPOINT snowball endpoint e) http://10.10.10.10:8080 or https://s3.ap-northeast-2.amazonaws.com --profile_name PROFILE_NAME aws_profile_name e) sbe1 --storage_class STORAGE_CLASS specify S3 classes, be cautious Snowball support only STANDARD class; StorageClass=STANDARD|REDUCED_REDUNDAN CY|STANDARD_IA|ONEZONE_IA|INTELLIGENT_TIERING|GLACIER| DEEP_ARCHIVE|OUTPOSTS|GLACIER_IR --bucket_prefix BUCKET_PREFIX prefix of object in the bucket --fs_dir FS_DIR specify fs mounting point when protocol is fs ``` ### Generating tarfiles and uploading to S3 directly You can run [s3archiver.py](https://github.com/aws-samples/small-files-archiving-solution/blob/main/s3archiver.py) to upload tarfiles into S3 directly. In this case, you can specify specify prefix optionally. ```bash ## without bucket_prefix python3 s3archiver.py --protocol s3 --src_dir '/data/nfsshare/fs1' --combine size --max_tarfile_size $((1*(1024**3))) --max_process 10 --bucket_name 'your-own-dest-repo' ``` ```bash ## with bucket_prefix python3 s3archiver.py --protocol s3 --src_dir '/data/nfsshare/fs1' --combine size --max_tarfile_size $((500*(1024**2))) --max_process 10 --bucket_name 'your-own-dest-repo' --bucket_prefix '/day1' ``` - --protocol s3: sending tarfile into S3 directly - --src_dir: specify source directory of filesystem to be archived - --combind count|size: Tarfile can be created by file size or number of files. if you specify **--combine size** and **--max_tarfile_size $((10*(1024**3)))***, this program will create 10GB TARfile each. if you specify **--combind count** and **--max_file_number 500**, TARfile will contain 500 original files. - --max_process: number of concurrent job. default is 10. It means 10 process will create TAR files in parallel - --max_tarfile_size: It means tarfile size. $((10*1024**3)) means 10*(1024*1024*1024)=10GB - --bucket_name: s3 bucket name - --bucket_prefix: target prefix in S3 bucket ### Generating tarfiles and saving to filesystem directory You can run [s3archiver.py](https://github.com/aws-samples/small-files-archiving-solution/blob/main/s3archiver.py) to save tarfiles in filesystem. In this case, you have to specify --fs_dir to indicate destination directory of filesystem. ```bash python3 s3archiver.py --protocol fs --src_dir '/data/nfsshare/fs1' --combine size --max_tarfile_size $((1*(1024**3))) --max_process 10 --fs_dir '/data2/dest' ``` - --protocol fs: saving tarfile in filesystem - --fs_dir: specify destination directory to store TAR files. #### Running archiving script ```bash [ec2-user@ip-172-31-42-60 ~]$ python3 s3archiver.py --protocol s3 --src_dir '/data/nfsshare/fs1' --combine size --max_tarfile_size $((1*(1024**3))) --max_process 10 --bucket_name 'your-own-dest-repo' starting script...2023-02-14 06:00:55.726736 /src/dataset/fs1 directory is archived archive-20230214_060055-V7N3GE.tar is combining based on count archive-20230214_060055-PJO3S1.tar is combining based on count archive-20230214_060055-YLVSEY.tar is combining based on count archive-20230214_060055-Z08HBY.tar is combining based on count archive-20230214_060055-4U86Q3.tar is combining based on count archive-20230214_060055-MJ0OVE.tar is combining based on count archive-20230214_060055-GBE44W.tar is combining based on count archive-20230214_060055-WTA0PS.tar is combining based on count archive-20230214_060055-MMRGE4.tar is combining based on count archive-20230214_060055-LHWY1P.tar is combining based on count archive-20230214_060055-YLVSEY.tar is archived successfully archive-20230214_060055-8V0VSU.tar is combining based on count archive-20230214_060055-Z08HBY.tar is archived successfully archive-20230214_060055-LUL0IK.tar is combining based on count archive-20230214_060055-4U86Q3.tar is archived successfully ... ... ... ``` ##### Result of script ```bash archive-20230214_055044-VOXBIY.tar is archived successfully archive-20230214_055044-T6O20X.tar is archived successfully archive-20230214_055044-E6KD87.tar is archived successfully archive-20230214_055044-MR3ZC8.tar is archived successfully /src/dataset/fs1 directory is archived ==================================== Combine: count size or count: 500 Duration: 0:00:10.501529 Scanned file numbers: 503006 TAR files location: /dest/2023/2/14 END ==================================== ``` success.log and error.log will be stored in **logs** directory below command running directory. ### Providing manifest files When uploading to S3 is finished, you can find manifest files in **lists** directory under **logs** directory and under **bucket_prefix**. When saving to filesystem, newly generated tarfiles and manifest files will be stored in path which provided by **--fs_dir** argument. There would be **lists** directory in destination location. In there, manifest files are stored. Each TARfile will have its own manifest files. ```bash [ec2-user@ip-172-31-45-24 small-files-archiving-solution]$ ls logs/lists archive_20230501_110237_36WP7R.tar-contents.csv archive_20230501_110237_JO90Y7.tar-contents.csv archive_20230501_110237_3JML96.tar-contents.csv archive_20230501_110237_NY8J0W.tar-contents.csv archive_20230501_110237_CKEF2M.tar-contents.csv archive_20230501_110237_RB5MMU.tar-contents.csv archive_20230501_110237_GQYIVS.tar-contents.csv archive_20230501_110237_SJH2P8.tar-contents.csv archive_20230501_110237_HOXYFB.tar-contents.csv archive_20230501_110237_V3DO1P.tar-contents.csv ``` This manifest file contains subset files information of one tarfile. Its information is including tarfile, subset file, date(year|month|day), file size(byte), start block, last block. - example of manifest file ```bash [ec2-user@ip-172-31-45-24 lists]$ head -n 10 archive_20230501_110237_36WP7R.tar-contents.csv archive_20230501_110237_36WP7R.tar|/data/nfsshare/fs1/d0042/dir0003/file0307|2023|5|1|21772|0|22527 archive_20230501_110237_36WP7R.tar|/data/nfsshare/fs1/d0042/dir0003/file0306|2023|5|1|24485|22528|47615 archive_20230501_110237_36WP7R.tar|/data/nfsshare/fs1/d0042/dir0003/file0305|2023|5|1|9846|47616|58367 archive_20230501_110237_36WP7R.tar|/data/nfsshare/fs1/d0042/dir0003/file0304|2023|5|1|33002|58368|92159 archive_20230501_110237_36WP7R.tar|/data/nfsshare/fs1/d0042/dir0003/file0303|2023|5|1|18425|92160|111103 archive_20230501_110237_36WP7R.tar|/data/nfsshare/fs1/d0042/dir0003/file0302|2023|5|1|30775|111104|142847 archive_20230501_110237_36WP7R.tar|/data/nfsshare/fs1/d0042/dir0003/file0301|2023|5|1|20421|142848|163839 archive_20230501_110237_36WP7R.tar|/data/nfsshare/fs1/d0042/dir0003/file0300|2023|5|1|34504|163840|199167 archive_20230501_110237_36WP7R.tar|/data/nfsshare/fs1/d0042/dir0003/file0299|2023|5|1|31042|199168|230911 archive_20230501_110237_36WP7R.tar|/data/nfsshare/fs1/d0042/dir0003/file0298|2023|5|1|19912|230912|251391 ``` start block and last block will be used to download subset file itself from S3 directly. ## Transferring tarfiles into S3 You can select 3 options. 1. using s3archiver.py with --protocol s3 option 2. using AWS DataSync 3. using StorageGateway while mounting --fs_dir directory via nfs ## Finding tarfiles in Amazon S3 Sometimes, we have to download some files from Amazon S3 to validate the product status. In this case, first we have to find tarfile which having specific subset files. Using AWS Athena we can find tarfile by condition, such as filename, date, duration using manifest file. ### Creating external table on AWS Athena with manifest files In order to search an object in manifest file, the first job is to create an external table with Athena query. It will create table schema based on contents of manifest files. Below is the sample query to create external table. ```bash CREATE EXTERNAL TABLE IF NOT EXISTS image_archiving ( tarname string, filename string, month int, day int, size int, start_byte int, stop_byte int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' ESCAPED BY '\\' LINES TERMINATED BY '\n' LOCATION 's3://your-own-dest-repo/lists/' ``` ![athena-1](images/athena-1.png) Next step is to execute query to find out the TAR file which contains specific object. - Query based on File name ``` select * from image_archiving where filename like '%/500000-%' limit 10; ``` - Select Query based on Date ``` select * from image_archiving where month=09 and day=16 limit 10; ``` - Select Query based on Duration ``` select * from image_archiving where month=09 and (day >= 16 and day <= 18) limit 10; ``` ![athena-2](images/athena-2.png) ## Retrieving subset file itself from a tarfile in S3 using [byte-range](https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html) After finding out the TAR file which containing specific object, user can retrieve that TAR file using AWS CLI or AWS management console, and then, user should extract TAR file to get target file. ![athena-3](images/athena-3.png) However, if we need only a few of files, downloading entire 10GB tarfile is not efficient. In this case, we can use byte-range feature of Amazon S3. for your convinience, I made simple script for it, [get_tar_part.py](https://github.com/aws-samples/small-files-archiving-solution/blob/main/get_bytes_range/get_tar_part.py). ### Running get_tar_part.py This [get_tar_part.py](https://github.com/aws-samples/small-files-archiving-solution/blob/main/get_bytes_range/get_tar_part.py) script will download the subset file and extract it in current directory. As well, you can specify sequencial blocks of multiple files. ``` python3 get_tar_part.py --bucket_name 'your-own-dest-repo' --key_name 'archive_20230501_110237_36WP7R.tar' --start_byte '2056192' --stop_byte '2113534' ``` - --bucket_name: bucket name which is storing tarfile - --key_name: tarfile name which found from previous Athena query - --start_byte: subset file's start block written on manifest file - --stop_byte: subset file's las block written on manifest file #### Result of script When script is finished, we can see extracted file names. ```bash [ec2-user@ip-172-31-45-24 get_bytes_range]$ sh run_get_tar_part.sh ['data/nfsshare/fs1/d0042/dir0003/file0216', 'data/nfsshare/fs1/d0042/dir0003/file0215', 'data/nfsshare/fs1/d0042/dir0003/file0214', 'data/nfsshare/fs1/d0042/dir0003/file0213'] ``` Here is the example of extracted files. ```bash [ec2-user@ip-172-31-45-24 get_bytes_range]$ ls -l data/nfsshare/fs1/d0042/dir0003/ total 60 -rw-rw-r-- 1 ec2-user ec2-user 7697 Dec 20 2019 file0213 -rw-rw-r-- 1 ec2-user ec2-user 13637 Dec 20 2019 file0214 -rw-rw-r-- 1 ec2-user ec2-user 26441 Dec 20 2019 file0215 -rw-rw-r-- 1 ec2-user ec2-user 6346 Dec 20 2019 file0216 ``` If specified block is not proper, it will create temp_tarfile to help to restore parts of files. As an example, I modified --stop_byte to 2113534 intentionlly even though proper stop_byte is 2113535. cat run_get_tar_part.sh ``` python3 get_tar_part.py --bucket_name 'your-own-dest-repo' --key_name 'archive_20230501_110237_36WP7R.tar' --start_byte '2056192' --stop_byte '2113534' ``` You can see warning message and incompleted tarfile. ```bash [ec2-user@ip-172-31-45-24 get_bytes_range]$ sh run_get_tar_part.sh unexpected end of data Warning: Incompleted tar block is detected, but temp_tarfile is generated, you could recover some of files from temp_tarfile temp tarfile: temp_tarfile-OTLEKX.tar ``` With _temp_tarfile-OTLEKX.tar_ file, you could save some files. ```bash [ec2-user@ip-172-31-45-24 get_bytes_range]$ tar tvf temp_tarfile-OTLEKX.tar -rw-rw-r-- ec2-user/ec2-user 6346 2019-12-20 06:23 data/nfsshare/fs1/d0042/dir0003/file0216 -rw-rw-r-- ec2-user/ec2-user 26441 2019-12-20 06:23 data/nfsshare/fs1/d0042/dir0003/file0215 -rw-rw-r-- ec2-user/ec2-user 13637 2019-12-20 06:23 data/nfsshare/fs1/d0042/dir0003/file0214 -rw-rw-r-- ec2-user/ec2-user 7697 2019-12-20 06:23 data/nfsshare/fs1/d0042/dir0003/file0213 tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now ``` ## Conclusion Small file archiving solution is built to provide efficient way of archiving small file on Amazon S3. Combing small files into big TAR file can help customer reduce PUT request cost and monitoring cost, and storing data in Amazon S3 Intelligent Tiering help customer save storage cost specially for long-term archiving data. With Amazon Athena, customer can search specific file when he needs to retrieve it. ## Security See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. ## License This library is licensed under the MIT-0 License. See the LICENSE file.