You can transfer data between your computer and our storage systems in several ways. You can mount Bucket as a remote folder; you can use ssh to copy data from the terminal; use Datashare to transfer data to other users at OIST; and you can use Rsync for fast, reliable transfer of large data sets.
Access Bucket as a shared remote folder from your
desktop
Use ssh and sftp to copy data on the command line
Quickly transfer data to other users with Datashare
Fast, reliable transfer using Rsync
Mount a remote directory using sshfs
You can add Bucket to your desktop as a remote folder. The details depend on your operating system, but you mount it as a shared drive using the SMB protocol.
| File system | Server | domain | share name |
|---|---|---|---|
bucket |
bucket.oist.jp |
OIST | bucket |
Please check the IT help pages for more information on how to mount remote shared folders on your operating system (links go to the external IT site):
The “scp” command, for “Secure CoPy”, is the main way to copy
files to and from Deigo. You tell it what to copy, then where to copy it to.
For example: If you want to copy a file myfile.txt from your own computer
to your home folder on Deigo, you would open a terminal (or a MobaXTerm window)
on your own computer and do:
$ scp myfile.txt oist-id@deigo.oist.jp:
The scp command copies files not just between directories, but between
computers. To specify the remote machine, the source and destination is split
in two parts, split by a colon (“:”). Before the colon is the name of the
remote computer, and after it comes the local path on that computer. Here is the full
pattern:
$ scp user-id@source-computer:/path/to/file user-id@destination:/path/to/folder
The first path is the address of the remote computer, with your user ID. Then a
colon “:” and then the path on that computer.
If you leave out the local path, scp assumes you mean your home directory. So
our scp command above will copy “myfile.txt” to our home directory.
If you leave out the remote computer and the colon, scp will assume you
specify a path on the local computer. If you accidentally forget the colon and
run:
$ scp myfile.txt oist-id@deigo.oist.jp
scp will copy “myfile.txt” into a local file in the current firectory named
“oist-id@deigo.oist.jp”, which is probably not what you wanted.
If you set up an ssh configuration file like described in “Connect to the Clusters” you can use the short names here too:
$ scp myfile.txt deigo:
The default remote target directory is your home. You can add a path after the colon, that’s where the file will go:
$ scp myfile.txt deigo:/bucket/MyUnit/
Everything before the colon in deigo: specifies the remote computer.
Everything after the colon is the directory or file on the remote
machine. You can use shell wildcards to specify multiple files.
You can do the same for copying data to and from folders in your home, to and from the Flash and Work file systems and so on.
To copy files from Deigo to your local computer, you need to start on your local machine, not Deigo1; and you specify the remote source on deigo and the local destination on your computer:
$ scp deigo:/bucket/UnitU/my_dir/* .
This would copy all the files in directory my_dir in Bucket on Deigo to the
current directory on your local computer.
Like most commands, scp by default only works on individual files. But if you
add the -r option, you can copy recursively, that is, copy all files in all
subdirectories. The below command will copy the my_dir directory and all
files and directories inside.
$ scp -r deigo:/bucket/UnitU/my_dir .
You can of course do the same in the other direction as well.
The Deigo login nodes have a directory (storage system) called
Datashare, mounted as /datashare. It’s meant for quick, temporary
sharing of data to workshop participants, or for members of different
units that need to quickly transfer data to each other.
If you copy data to Datashare it will automatically become readable to anybody. Any data there will also be deleted after about a few days if you forget to delete it yourself.
If you want to transfer a file to somebody else, follow these steps:
Note: All data here is publicly readable. Don’t use Datashare to transfer restricted or confidential data!
rsync is an advanced program for synchronizing directories and copying
large number of files locally or over a network using ssh.
Before the copy, rsync compares the source and destination folders. Only the files that have changed are actually transferred.
If you use rsync to keep two directories in sync between your local and remote machines, only the changes since the last sync will be transferred. This greatly reduces the amount of time you need to synchronize them.
If rsync is interrupted - because you lost the connection, or because you had to leave and turn off the computer for instance - it will pick up again where it left off instead of starting over. This makes rsync a very dependable way to copy large data volumes over slow or unreliable network connections.
It is a complicated program with many options, so we refer you to the main documentation for all the details. Here are a couple of common examples:
$ rsync -av --no-group --no-perms mydir/ deigo:target-dir/
This sends the content of “mydir” on a local computer into “target-dir”.
Note the “/” at the end of mydir/. For rsync this means that you want to
copy the contents inside, and not create the “mydir” directory itself. The
files inside “mydir” end up inside “target-dir”.
The flag “-a” is short for “archive”. It will copy all file
attributes as well, such as creation and modification dates, links,
permissions, ownership and so on.
”-v” means to be verbose and print out what rsync is doing. You may want to
remove this if you’re copying a very large amount of small files.
”--no-group” and “--no-perms” makes sure we don’t copy ownership and
permissions, as they wouldn’t match on the remote system.
Unlike other command line commands, the final slash “/” actually matters with rsync. To copy the directory itself, not just the contants, remove the final slash on the source:
$ rsync -av --no-group --no-perms mydir deigo:target-dir/
This sends “mydir” itself and its contents into “target-dir”.
Without a “/” at the end, rsync will copy the directory itself, so you
get “target-dir/mydir/”.
Note that the final slash after the target doesn’t matter.
$ rsync -av --no-group --no-perms --partial mybigfiles/ deigo:target-dir/
For very large files (gigabytes or more), resending the entire file if it got
interrupted would waste a lot of time. The “--partial” option tells rsync not
to delete partial files if it gets interrupted, but to pick up where it left
off.
Sometimes you want to skip certain files while copying the rest.
$ rsync -av --no-group --no-perms --exclude=*bk deigo:datadir localdir/
The --exclude option lets you set a pattern of file names to skip when
copying. In this case, any file ending in “.bk” (common for backup files) will
be skipped when you copy “datadir/” on Deigo to your local folder “localdir/”
on your own computer.
This can be dangerous, but: you can tell rsync to delete any files in the
destination folder that doesn’t exist in the source folder. Use the --del
option:
$ rsync -av --no-group --no-perms --del deigo:datadir localdir/
This synchronizes everything in “datadir” on Deigo with “localdir” on the
local machine. I will delete any files in “localdir” that are not in
“datadir” (that is, if they were deleted in “datadir” they’ll be
deleted locally as well).
This can be useful if the source folder keeps changing as you work, and you want to keep an up to date copy of it elsewhere. So if you deleted a file in “datadir” on Deigo, you want that file gone in your local copy as well.
“sshfs” is a “pseudo-filesystem” that you can use on Linux and MacOS. It will
use SSH to connect a remote and a local directory, much like mounting a remote
filesystem. In the background it uses ssh to actually transfer the data, but
it lets you treat your remote directory as a part of your regular filesystem.
First you need to get sshfs. On Linux it is available from your distributions package manager. On OSX you may need to install it through one of the open source distribution systems. Once you installed it, the format for starting it is:
$ sshfs [options] deigo:datadir localdir/
This will mount “datadir” in your home on Deigo onto “localdir” on your local
computer. “localdir” needs to be an empty directory. Ideally you would
make a specific subdirectory, one for each remote, in a “mount”
directory:
$ mkdir -p mount/deigo
If there’s no activity, ssh will normally close the connection after some time. That is very inconvenient when you use it as a file system. Also, sshfs will by default not try to reconnect if it loses connection. Finally, your user ID is different on the local and remote machine, and we want to make sure any files are presented with the right ownership.
You will want to use three options for sshfs: “reconnect” to make it
reconnect; “idmap=user” to resolve user identity differences; and
“ServerAliveInterval=30” to keep the connection alive and to detect
if it disconnects, by pinging the server every 30 seconds.
Let’s say I want to access my /bucket/UnitU/mydata/ directory on
Deigo, and mount it locally on mount/deigo. I would do this as:
$ sshfs -o reconnect,idmap=user,ServerAliveInterval=30 deigo:/bucket/UnitU/mydata mount/deigo
That command is a mouthful, so you may want to put this in a small shell script.
You can unmount it again with “fusermount”:
$ fusermount -u mount/deigo
This gives you the same convenience as using Samba to mount Bucket as a remote folder, but with more flexibility.
sshfs is very convenient: you can mount any directory you can reach with ssh, in a safe, encrypted manner, and treat as a local directory without using a VPN or any other extras. But it has a few drawbacks.
But sshfs does not deal well with disconnections. Any access to the directory while it’s disconnected will hang, waiting for a remote reply. You can try it for yourself. Mount a directory on deigo as above, disable wifi, then try listing the directory:
$ ls mount/deigo
After a few seconds, the command will suddenly hang, and can’t be stopped. In fact, any software that directly or indirectly tries to access that directory will now freeze.
Since we added the “ServerAliveInterval” option above, sshfs will
eventually give up trying and let the applications run again, but it
will still take up to a minute or so. For this reason, sshfs is really
better suited for your workstation than for a laptop that often loses
the connection as you move about.
To forcibly stop sshfs, you can force unmount the file system (you may need to do it as root):
$ sudo umount --force mount/deigo
You may need to repeat the command a few times before it really takes effect.
You can also look for the actual ssh process and kill it:
$ ps ax|grep "ssh.*sftp"
26854 pts/23 S 0:00 ssh -x -a [...] -oServerAliveInterval=30 [...] deigo -s sftp
$ kill -9 26854
Just be careful that you don’t kill the wrong ssh process by mistake.
Finally, if sshfs has disconnected, the OS may still mistakenly see the remote
as mounted, so you can’t remount. Then you can do a “lazy” unmount (where it
doesn’t wait for a response from the server) with the “-z” option to make the
OS release the mount point:
$ fusermount -uz mount/deigo
That will let you remount it again immediately if you want.
Your own computer doesn’t have a fixed address or name on the network so there’s no way for you to connect to it from Deigo. ↩