OIST Banner OIST Banner OIST Banner Dark Mode OIST Banner Dark Mode

Transfer Data

You can transfer data between your computer and our storage systems in several ways. You can mount Bucket as a remote folder; you can use ssh to copy data from the terminal; use Datashare to transfer data to other users at OIST; and you can use Rsync for fast, reliable transfer of large data sets.

Access Bucket as a shared remote folder from your desktop
Use ssh and sftp to copy data on the command line
Quickly transfer data to other users with Datashare

Fast, reliable transfer using Rsync
Mount a remote directory using sshfs

Using remote folders

You can add Bucket to your desktop as a remote folder. The details depend on your operating system, but you mount it as a shared drive using the SMB protocol.

File system Server domain share name
bucket bucket.oist.jp OIST bucket

Please check the IT help pages for more information on how to mount remote shared folders on your operating system (links go to the external IT site):

Windows
Mac OS
Linux

Copy Files Using scp

The “scp” command, for “Secure CoPy”, is the main way to copy files to and from Deigo. You tell it what to copy, then where to copy it to.

For example: If you want to copy a file myfile.txt from your own computer to your home folder on Deigo, you would open a terminal (or a MobaXTerm window) on your own computer and do:

$ scp myfile.txt oist-id@deigo.oist.jp:

The scp command copies files not just between directories, but between computers. To specify the remote machine, the source and destination is split in two parts, split by a colon (“:”). Before the colon is the name of the remote computer, and after it comes the local path on that computer. Here is the full pattern:

$ scp user-id@source-computer:/path/to/file  user-id@destination:/path/to/folder

The first path is the address of the remote computer, with your user ID. Then a colon “:” and then the path on that computer.

If you leave out the local path, scp assumes you mean your home directory. So our scp command above will copy “myfile.txt” to our home directory.

If you leave out the remote computer and the colon, scp will assume you specify a path on the local computer. If you accidentally forget the colon and run:

$ scp myfile.txt oist-id@deigo.oist.jp

scp will copy “myfile.txt” into a local file in the current firectory named “oist-id@deigo.oist.jp”, which is probably not what you wanted.

If you set up an ssh configuration file like described in “Connect to the Clusters” you can use the short names here too:

$ scp myfile.txt deigo:

Copy files to Bucket and Other Locations

The default remote target directory is your home. You can add a path after the colon, that’s where the file will go:

$ scp myfile.txt deigo:/bucket/MyUnit/

Everything before the colon in deigo: specifies the remote computer. Everything after the colon is the directory or file on the remote machine. You can use shell wildcards to specify multiple files.

You can do the same for copying data to and from folders in your home, to and from the Flash and Work file systems and so on.

Copy files from Deigo

To copy files from Deigo to your local computer, you need to start on your local machine, not Deigo1; and you specify the remote source on deigo and the local destination on your computer:

$ scp deigo:/bucket/UnitU/my_dir/* .

This would copy all the files in directory my_dir in Bucket on Deigo to the current directory on your local computer.

Copy entire Directories

Like most commands, scp by default only works on individual files. But if you add the -r option, you can copy recursively, that is, copy all files in all subdirectories. The below command will copy the my_dir directory and all files and directories inside.

$ scp -r deigo:/bucket/UnitU/my_dir .

You can of course do the same in the other direction as well.

Datashare

The Deigo login nodes have a directory (storage system) called Datashare, mounted as /datashare. It’s meant for quick, temporary sharing of data to workshop participants, or for members of different units that need to quickly transfer data to each other.

If you copy data to Datashare it will automatically become readable to anybody. Any data there will also be deleted after about a few days if you forget to delete it yourself.

If you want to transfer a file to somebody else, follow these steps:

  1. From a login node, copy the data to /datashare.
  2. Ask the recipients to log in and copy the data from /datashare to their own bucket or home.
  3. Once they’ve done this, delete the data.

Note: All data here is publicly readable. Don’t use Datashare to transfer restricted or confidential data!

Other file access

Rsync

rsync is an advanced program for synchronizing directories and copying large number of files locally or over a network using ssh.

Before the copy, rsync compares the source and destination folders. Only the files that have changed are actually transferred.

It is a complicated program with many options, so we refer you to the main documentation for all the details. Here are a couple of common examples:

$ rsync -av --no-group --no-perms mydir/ deigo:target-dir/

This sends the content of “mydir” on a local computer into “target-dir”. Note the “/” at the end of mydir/. For rsync this means that you want to copy the contents inside, and not create the “mydir” directory itself. The files inside “mydir” end up inside “target-dir”.

The flag “-a” is short for “archive”. It will copy all file attributes as well, such as creation and modification dates, links, permissions, ownership and so on.

-v” means to be verbose and print out what rsync is doing. You may want to remove this if you’re copying a very large amount of small files.

--no-group” and “--no-perms” makes sure we don’t copy ownership and permissions, as they wouldn’t match on the remote system.

Copy a directory, not just its contents

Unlike other command line commands, the final slash “/” actually matters with rsync. To copy the directory itself, not just the contants, remove the final slash on the source:

$ rsync -av --no-group --no-perms mydir deigo:target-dir/

This sends “mydir” itself and its contents into “target-dir”. Without a “/” at the end, rsync will copy the directory itself, so you get “target-dir/mydir/”.

Note that the final slash after the target doesn’t matter.

Copying very large files

$ rsync -av --no-group --no-perms --partial mybigfiles/ deigo:target-dir/

For very large files (gigabytes or more), resending the entire file if it got interrupted would waste a lot of time. The “--partial” option tells rsync not to delete partial files if it gets interrupted, but to pick up where it left off.

Avoid copying certain files

Sometimes you want to skip certain files while copying the rest.

$ rsync -av --no-group --no-perms --exclude=*bk deigo:datadir localdir/

The --exclude option lets you set a pattern of file names to skip when copying. In this case, any file ending in “.bk” (common for backup files) will be skipped when you copy “datadir/” on Deigo to your local folder “localdir/” on your own computer.

Keep two folders synchronized

This can be dangerous, but: you can tell rsync to delete any files in the destination folder that doesn’t exist in the source folder. Use the --del option:

$ rsync -av --no-group --no-perms --del deigo:datadir localdir/

This synchronizes everything in “datadir” on Deigo with “localdir” on the local machine. I will delete any files in “localdir” that are not in “datadir” (that is, if they were deleted in “datadir” they’ll be deleted locally as well).

This can be useful if the source folder keeps changing as you work, and you want to keep an up to date copy of it elsewhere. So if you deleted a file in “datadir” on Deigo, you want that file gone in your local copy as well.

sshfs

sshfs” is a “pseudo-filesystem” that you can use on Linux and MacOS. It will use SSH to connect a remote and a local directory, much like mounting a remote filesystem. In the background it uses ssh to actually transfer the data, but it lets you treat your remote directory as a part of your regular filesystem.

First you need to get sshfs. On Linux it is available from your distributions package manager. On OSX you may need to install it through one of the open source distribution systems. Once you installed it, the format for starting it is:

$ sshfs [options] deigo:datadir localdir/

This will mount “datadir” in your home on Deigo onto “localdir” on your local computer. “localdir” needs to be an empty directory. Ideally you would make a specific subdirectory, one for each remote, in a “mount” directory:

$ mkdir -p mount/deigo

If there’s no activity, ssh will normally close the connection after some time. That is very inconvenient when you use it as a file system. Also, sshfs will by default not try to reconnect if it loses connection. Finally, your user ID is different on the local and remote machine, and we want to make sure any files are presented with the right ownership.

You will want to use three options for sshfs: “reconnect” to make it reconnect; “idmap=user” to resolve user identity differences; and “ServerAliveInterval=30” to keep the connection alive and to detect if it disconnects, by pinging the server every 30 seconds.

Let’s say I want to access my /bucket/UnitU/mydata/ directory on Deigo, and mount it locally on mount/deigo. I would do this as:

$ sshfs -o reconnect,idmap=user,ServerAliveInterval=30 deigo:/bucket/UnitU/mydata mount/deigo

That command is a mouthful, so you may want to put this in a small shell script.

You can unmount it again with “fusermount”:

$ fusermount -u mount/deigo

This gives you the same convenience as using Samba to mount Bucket as a remote folder, but with more flexibility.

Issues with sshfs

sshfs is very convenient: you can mount any directory you can reach with ssh, in a safe, encrypted manner, and treat as a local directory without using a VPN or any other extras. But it has a few drawbacks.

But sshfs does not deal well with disconnections. Any access to the directory while it’s disconnected will hang, waiting for a remote reply. You can try it for yourself. Mount a directory on deigo as above, disable wifi, then try listing the directory:

$ ls mount/deigo

After a few seconds, the command will suddenly hang, and can’t be stopped. In fact, any software that directly or indirectly tries to access that directory will now freeze.

Since we added the “ServerAliveInterval” option above, sshfs will eventually give up trying and let the applications run again, but it will still take up to a minute or so. For this reason, sshfs is really better suited for your workstation than for a laptop that often loses the connection as you move about.

To forcibly stop sshfs, you can force unmount the file system (you may need to do it as root):

$ sudo umount --force mount/deigo

You may need to repeat the command a few times before it really takes effect.

You can also look for the actual ssh process and kill it:

$ ps ax|grep "ssh.*sftp"
26854 pts/23 S  0:00 ssh -x -a [...] -oServerAliveInterval=30 [...] deigo -s sftp
$ kill -9 26854

Just be careful that you don’t kill the wrong ssh process by mistake.

Finally, if sshfs has disconnected, the OS may still mistakenly see the remote as mounted, so you can’t remount. Then you can do a “lazy” unmount (where it doesn’t wait for a response from the server) with the “-z” option to make the OS release the mount point:

$ fusermount -uz mount/deigo

That will let you remount it again immediately if you want.

Footnotes

  1. Your own computer doesn’t have a fixed address or name on the network so there’s no way for you to connect to it from Deigo.