3

I can select a random file using this command

find ./ -type f | shuf -n 1

But it's showing the same file some times.
Is it possible to stop picking duplicate files?
Is there any other utility for this task?

I have around 50k txt files in a folder which may have recursive subfolders and I want to pick a random file to see it and I don't want to see it again + there are new files added to the folder every day...

muru
  • 69,900
  • 13
  • 192
  • 292
Akhil
  • 1,220
  • 2
  • 16
  • 31
  • 7
    "Without repetition" means remembering previous choices (assuming you're not deleting files after selecting them). Why not return a randomly-ordered list and consume it in order? – Jeff Schaller Nov 22 '19 at 19:28
  • Tangentially related (but not for the `bash` shell): [Returning randomised items from glob match](//unix.stackexchange.com/q/551160) – Kusalananda Nov 22 '19 at 19:35
  • Related: https://unix.stackexchange.com/q/543186/117549 – Jeff Schaller Nov 22 '19 at 21:03
  • Why not return a randomly-ordered list and consume it in order?... files will be updated often... so this is not a good idea...and "Without repetition" means remembering previous choices - yes .... @JeffSchaller – Akhil Nov 24 '19 at 05:35

4 Answers4

3

The issue with your code is that you are re-generating the list each time to pick a new pathname. This would potentially give you the same pathnames over and over again for as long as you keep the same files in the directories that you generate the list over.

The simple answer for the case when you occasionally running your script, is to move the process files away (or delete them). This way, the next time when you run the script and re-generate the random list, the already processed files will not be part of the list.

For example, assuming all files are located in or below the directory $HOME/newfiles, the following would pick a file and then move it to $HOME/oldfiles:

myfile=$( find "$HOME/newfiles" -type f -print0 | shuf -z -n 1 )

# use "$myfile" here

# later... move "$myfile" to somewhere else:
mv "$myfile" "$HOME/oldfiles"

The rest of this answer is concerning the case when you want to loop over randomised pathnames in one and the same invocation of the script.


Assuming your files and directories do not contain embedded newlines, this shows what Jeff Schaller suggested in a comment:

find ./ -type f | shuf |
while IFS= read -r pathname; do
    # do work with "$pathname"
done

This would give you random pathnames of regular files in or below the current directory, if, as I mentioned, none of the pathnames in the hierarchy contained newlines (in which case shuf would scramble these names).

A safe variant would be to scramble the list with a nul-terminated list:

readarray -t -d '' pathnames < <( find . -type f -print0 | shuf -z )
for pathname in "${pathnames[@]}"; do
    # use "$pathname" here
done

This example (and the next) is adapted from https://unix.stackexchange.com/a/543188/116858


In the zsh shell, you could possibly do

for pathname in ./**/*(.DNnoe['REPLY=$RANDOM'])
do
   # use $pathname here
done

This works similarly to the code above with the difference that since this is using a shell glob and no line-oriented text-filtering tools, newlines in filenames would not be an issue (and you don't have to pass around nul-terminated lists).

The neat thing about doing this in zsh is that you don't need to call any external tools.

Kusalananda
  • 320,670
  • 36
  • 633
  • 936
  • when I re-run the script after some time .. it may show duplicate files again.... @kusalananda – Akhil Nov 23 '19 at 05:05
  • @AkhilJ See updated answer. If you have further information about how you want to run your script, or anything else that may be helpful, then please update your question rather than adding it in comments. – Kusalananda Nov 23 '19 at 07:37
2

If I am understanding the question properly, one thing that OP can do is to shuffle the list into a file (or variable if in a BASH script), then pull out elements from that list. In this way, OP will not call the same file twice until the end of the full list.

For example,

find ./ -type f | shuf > shuffled.txt

to create the list in a file, then call it via something along the lines of,

cat shuffled.txt | head -1 | tail -1
cat shuffled.txt | head -2 | tail -1
cat shuffled.txt | head -3 | tail -1
...

Or an equivalent line with sed or awk.

Alternatively, if this is all being placed into a BASH script, it's possible to do something like this as well:

for filename in $(find ./ -type f | shuf)
do
    echo ${filename}
    ... do something to ${filename}
done
Jason K Lai
  • 534
  • 2
  • 8
1

How about just working with the inode....

[[ ! -f seen ]] && touch seen && ls -i seen > seen                       
file=$(find . -type f -printf %i"\n" | sort | join -j 1 -v 1 - seen | shuf -n 1)
echo $file >> seen
sort -o seen seen
find -inum $file -exec cat {} \; #or whatever you want to do with the file

Doesn't matter if the seenfile is in your search path, and if it is then just add its own inode to itself to be screened out.

For a single session of inspection just loop over the list

[[ ! -f seen ]] && touch seen && ls -i seen > seen
sort -o seen seen
list=$(mktemp)                        
find . -type f -printf %i"\n" | sort | join -j 1 -v 1 - seen | shuf -o $list
while read file; do
    echo $file >> seen
    find -inum $file -exec sh -c 'echo -e "$1 contains ....\n"; cat "$1"; echo -e "\n\n"' sh {} \;
    sleep 1
done < $list

Note: The assumption is that files are not deleted. If they are and inodes are reused then they have will have to be to be deleted from seen

After discovering that sed copies and rewrites files and changes the inode for the seen file then this approach gets more complicated.... a solution to the deletion issue could be to use ed rather than sed.

To delete the file touch wood

d="touch wood"; find . -iname "$d" -printf %i"\n%p\n" | while read i ; do read f; rm "$f" ;printf "%s\n" "/$i/d" wq | ed -s seen; done;
bu5hman
  • 4,663
  • 2
  • 14
  • 29
  • I found this answer complex compared to others...also this script seems using more Unix tools... @bu5hman – Akhil Nov 24 '19 at 12:12
  • It is more complex as it tries to keep count of all specific files inspected in a structured way. I must admit to having been stumped at first by the fact that `sed` copies the file and this destroys the `inode` references which is a pain but if you are not deleting files then this is not an issue. A similar approach could be used on raw file paths without the complication. If the general approach of keeping a log file works for you then it's easy enough to revamp for logging file names not `inodes`. Would a more detailed explanation help? – bu5hman Nov 24 '19 at 17:37
  • You can't use the inode to check if a file has been changed. It may or may not change when a file is renamed, or moved. @bu5hman – Akhil Nov 25 '19 at 03:43
  • Agreed, that's why deletion from the tree is a problem, but it isnt of the files are just accumulating. Was an interesting thing to try though. – bu5hman Nov 25 '19 at 05:10
0
  1. @using find
find ./ -type f | shuf |
while IFS= read -r pathname; do
    if ! grep -xF "$pathname" ~/shuffled.txt; then
      # do work with "$pathname"
      echo "$pathname" >> ~/shuffled.txt
    fi
done

here it will keep track of shuffled files.

  1. @using mlocate

every time using find takes more time... instead, it's better to use mlocate utility here...

#!/bin/bash
set -e
sudo updatedb -U ./ -o mlocate.db && locate -d mlocate.db '*' | shuf |
while IFS= read -r pathname; do
  if [ -f "$pathname" ]; then
    if ! grep -xF "$pathname" ~/shuffled.txt; then
      # do work with "$pathname"
      echo "$pathname" >> ~/shuffled.txt
    fi
  fi
done

in this way updatedb looks for new files only instead of rescanning all files

Akhil
  • 1,220
  • 2
  • 16
  • 31