There are two text files, each are 10 million lines, the size of the text file at about 100M. Now need to know that the two documents there is cross-check the number of lines, in other words, we want to know the the number of lines simultaneously in the two documents exist. Each text file here is unique, so they do not have any duplicate rows. Python set could do this very easy and higher efficient than shell, awk.
#!/usr/bin/python a = set(open(”data.uniq.1″)) b = set(open(”date.uniq.2″)) print len(a; b)
Here I find a blog in Chinese also description this tips