NCBI Data Submit with FTP/ASCP
This post only talks about how to use the ascp
to upload your sequencing data into NCBI.
3 Different Ways of Submit Your Data
- Cloud from Amazon S3 or Google Cloud
If your data was stored in Amazon/Google Cloud at the beginning, you can easily and safely transfer them into NCBI. (I think so though I’ve never tried). - FTP or ASCP
I would recommend the second approach since our data was mostly stored in a Linux server. FTP and ASCP are very reliable. Especially forftp
, it is a very popular protocol. You could find a bunch of software like ‘FileZilla’ to upload through ftp. The best feature of ‘FileZilla’ is it supports resume interrupted transfer and lists the fail-up loaded files so you can upload them again with one click. So, no matter how many and how big the files are, it can help you upload them safely. The main limitation for FileZilla is that it can’t used in command form and so, is not suitable for the server. - Web Browser
Unless your data are very small, you would never want to try uploading them online.
When to Upload Your Data
You can upload your data whenever you want. It is better to upload your file before you start to fill the submission tables. In step 7, you could find the data/director you submitted and include them in the submission. But it seems like the NCBI would delete an inactivated file within 30 days. It is long enough to finish the submission.
I prefer to use FileZilla
to upload my data. But since now, all of my data are on the server and I don’t want to download them again, I give the ftp
and use ascp
.
How to use the ASCP
ASCP is very easy and works similarly to scp
. In the Submission home page, select My submissions → Upload via Aspera command line or FTP → Aspera command line instructions to find the instructions. It would give you the download link and key for connecting to the NCBI server. After that, use it just like the scp
.
Shortcomings or suggestions for ascp
- Keep everything in 1 director: You’d like to upload all files into one directory. Because during the submission step 7 FILES, you could only select 1 directory.
- No sub-directories: You may want to upload the directory without subdirectories. Because in the submission portal, you can’t check files in subdirectories. So, it is hard to track back which filed upload failed.
- Keep ascp log: In the submission portal, you got only first 5 lines of uploaded files. So, remember to keep the
ascp
log to record the fail uploaded data. - upload the data one by one: Strongly recommend to upload each
fq
or compressed file one by one with scripts. When you try to upload the entire directory, it may fail (I never get it done when upload a directory) - After you upload your data, you can’t see them until you go to step 7 FILES in the submission portal.
- You can’t check them immediately even from the submission portal. It takes time for them to show in the Step 7
The good thing is it is easy to write a script to upload your data automatically.
In the instructions, it suggest you to use those parameters: `-QT -l100m -k1`
-Q
: This option disables the real-time display of progress and transfer statistics during the transfer. Normally, ASCP displays ongoing statistics, such as speed and percentage of completion, but using-Q
will suppress this output.-T
: This option disables encryption of the data stream during transfer. ASCP by default uses encryption for data security, but-T
turns this off, which might improve transfer speed but at the cost of security.-l100m
: This sets the transfer speed limit to 100 megabits per second. You can adjust the value (e.g.,100m
) to control how fast the transfer is allowed to go, helping to prevent network congestion or manage bandwidth usage.-k1
: This option controls file resume behavior. The value1
means that if a transfer is interrupted, ASCP will resume from the point where it left off (resumable transfer). The other possible values for-k
are:
0
: No resume. The transfer restarts from the beginning.2
: Sparse resume. ASCP resumes only the missing parts of the file.
Personal Experience
Upload your data into a specific directory
Though we gave the argument -k1
, it could still fail. In the log, it says:
Partial Completion: 19711732K bytes transferred in 3695 seconds (43691K bits/sec), in 6 files, 4 directories; 3 files failed. Session Stop (Error: Disk write failed (server))
After you went to the step 7, you could see:
Which means 2 directories are empty. In this case, you don’t need to worry too much. You can change the code a little bit and continue to upload could solve this problem.
For example, you uploaded a directory named ALL_RNA
with code ascp -i $key_file -QT -l100m -k1 -d ALL_RNA $AddressFromInstruction
, the data in the directory ALL_RNA/SAMPLEX
was failed to upload, you can use the code ascp -i $key_file -QT -l100m -k1 -d ALL_RNA/SAMPLEX $AddressFromInstruction/ALL_RNA
to continue upload the directory SAMPLEX
into the ALL_RNA
in NCBI server
ascp -i $key_file -QT -l100m -k1 -d ALL_RNA $AddressFromInstruction ascp -i $key_file -QT -l100m -k1 -d ALL_RNA/SAMPLEX $AddressFromInstruction/ALL_RNA
Upload your date in the script
When you have lots of data, one of the convenient ways I found is we could ascp
each data independently. With a for loop, we could generate all codes into a script. When there are failed uploads, we just need to copy and paste the failed codes and run them again. Or we could also delete the code for the successfully uploaded one and run the entire script again.
|
NCBI Data Submit with FTP/ASCP