User:Glauxosdever/Version Control Software: Difference between revisions

Jump to navigation Jump to search
Add some example design and implementation (merging still to do)
(Add Version Control Software stub)
 
(Add some example design and implementation (merging still to do))
Line 9:
=== Approach A ===
 
Initially, use [[git]] or any other ready VCS. Then, when your OS is ready to be self-hosting, write a VCS that runs on your OS and switch all development to it. This will have the drawback that your new repository will not hold the past history of the period that you used [[git]] or any other ready VCS.
 
=== Approach B ===
Line 21:
== Writing a Basic VCS ==
 
In this section, we will write a very basic VCS. It will not even remotely support any remote functionality (pun intended), but it will be enough to get you started. We will write it in [[Python]], since Python is a good programming language for prototyping stuff.
 
=== The Internals ===
 
We will create a snapshot-based VCS, i.e. we won't store the diff of the files in each commit. Instead, we will store the full files in order for retrieving source trees to be an efficient operation. You may of course want to change that later.
 
We will store the repository on the filesystem (you could use a database instead if you like it). The root directory of the repository will be <code>.repo</code>, which will contain the subdirectories <code>branches</code>, <code>commits</code> and <code>files</code>. As you probably guessed, <code>branches</code> will contain information about branches. However, <code>commits</code> will only contain information about the contents of commits, i.e. no author, time, and such information -- that's in order to deduplicate space when two commits are similar in contents. <code>files</code> will only contain information about the file contents, i.e. no author, time, and such information -- again, that's in order to deduplicate space when two files are similar in contents.
 
Additionally, the working directory for each branch will be in a directory at the same level as the <code>.repo</code> directory, which will be named after the branch.
 
==== Branches ====
 
The <code>branches</code> directory will contain subdirectories named after the names of the branches. For example, if we create a branch "master", a subdirectory <code>master</code> will be created. Inside, we will have a <code>commits</code> file that will contain, per line per commit, a commit hash and a timestamp.
 
The <code>commits</code> file may look like this:
<pre>
d97cf57b44b1b43d8d66e360c6d0c8a66a04ee18e092a4cad4684e26c86fbf41 1547387622
ba04414fa9bf63b8a39f658483edc0f6b8a8c759f707ec6846134986db9dc929 1547388056
</pre>
 
==== Commits ====
 
The <code>commits</code> directory will contain subdirectories named after the hashes of the commits. For example, if we create a commit "d97cf57b44b1b43d8d66e360c6d0c8a66a04ee18e092a4cad4684e26c86fbf41", a subdirectory <code>d97cf57b44b1b43d8d66e360c6d0c8a66a04ee18e092a4cad4684e26c86fbf41</code> will be created. Inside, we will have a <code>branches</code> file that will list the branches where this commit it exists. Also, we will have a <code>files</code> file that will list, per line per file, its hash and its path.
 
The <code>branches</code> file may look like this:
<pre>
master
feature1
feature2
</pre>
 
The <code>files</code> file may look like this:
<pre>
7d2b62d9a6f9a0ebfdde033e3cfb197f4d46998c22853861eab603cbd1246920 Makefile
487ea42328a8738a12b378dce4329e2432a0938ff12832ec932fa88c948ca015 src/test.c
</pre>
 
A commit hash will be derived using the SHA256 algorithm, where the input will be the concatenation of all paths sorted in alphabetical order and all file contents in the same order as the paths. You may want to change that possibly (but, if you include the timestamp, you won't be able to deduplicate similar commits).
 
* '''''Note''''': Real VCSs (e.g. [[git]]) don't put all hashes in one directory. Instead, they split the hashes into directory levels. That's in order to minimise directory search times.
* '''''Note''''': Git includes the timestamp in the input string for deriving the commit hash.
 
==== Files ====
 
The <code>files</code> directory will contain subdirectories named after the hashes of the file revisions. Inside, we will have a <code>commits</code> file that will list the commits where this file revision exists. Also, the <code>data</code> file will contain the contents of the file as is appears during that revision.
 
The <code>commits</code> file may look like this:
<pre>
d97cf57b44b1b43d8d66e360c6d0c8a66a04ee18e092a4cad4684e26c86fbf41
ba04414fa9bf63b8a39f658483edc0f6b8a8c759f707ec6846134986db9dc929
</pre>
 
* '''''Note''''': Real VCSs employ compression for the file contents, in order to decrease disk space usage.
 
=== The Code ===
 
So we get the following <code>.py</code> files...
 
==== utils.py ====
 
This file contains four functions as described in the comments. It's still nothing VCS-specific.
 
<source lang="python">
# This function reads <filename> into a list of lines.
# The trailing '\n' character of each line is discarded.
def readlines(filename):
fp = open(filename, "r")
lines = fp.read().splitlines()
return lines
 
# This function writes a list of lines into <filename>.
# The trailing '\n' character of each line is added.
def writelines(filename, lines):
fp = open(filename, "w")
for line in lines:
fp.write(line + "\n")
fp.close()
 
# This function reads <filename> into a list of lists of words.
def readwordlines(filename):
lines = readlines(filename)
wordlines = [line.split(" ") for line in lines]
return wordlines
 
# This function writes a list of lists of words into <filename>.
def writewordlines(filename, wordlines):
lines = [" ".join(wordline) for wordline in wordlines]
writelines(filename, lines)
</source>
 
==== init.py ====
 
This file contains code to initialise an empty repository. You should make it executable.
 
<source lang="python">
#! /usr/bin/python
 
import os
 
import utils
 
def main():
# Create necessary directories.
os.mkdir(".repo", 0755)
os.mkdir(".repo/branches", 0755)
os.mkdir(".repo/commits", 0755)
os.mkdir(".repo/files", 0755)
 
if __name__ == "__main__":
main()
</source>
 
==== branch.py ====
 
This file contains code to create an empty branch. You should also make it executable.
 
<source lang="python">
#! /usr/bin/python
 
import os
import sys
 
import utils
 
def main():
# Get branch name.
branch = sys.argv[1]
# Here you should probably check if the branch name is allowed.
# For example, you may want to disallow certain names that have
# a special meaning in your internal representation.
 
# Create the .repo/branches/<branch> directory.
os.mkdir(".repo/branches/" + branch, 0755)
 
# Create an empty .repo/branches/<branch>/commits file.
fp = open(".repo/branches/" + branch + "/commits", "w")
fp.close()
 
# Create the <branch> directory, if it doesn't already exist.
try:
os.mkdir(branch, 0755)
except OSError:
# Directory already exists.
pass
 
if __name__ == "__main__":
main()
</source>
 
==== commit.py ====
 
Now stuff is getting complex. I hope I have explained it well in the comments...
 
You should also make this file executable.
 
<source lang="python">
#! /usr/bin/python
 
import hashlib
import os
import sys
import time
 
import utils
 
def main():
# Get branch name.
branch = sys.argv[1]
 
# Get current time.
now = int(time.time())
 
# Get the list of all files in the branch source tree.
filepaths = []
for cur, dirs, files in os.walk(branch):
for f in files:
# Get the paths without the beginning "<branch>/".
# For example, the path "master/src/main.c" becomes "src/main.c".
filepaths.append((cur + "/" + f).split("/", 1)[1])
 
# Sort the paths as it's important for them to be in a specific order
# for the SHA256 hash function. Imagine if you got two different hashes
# for the same directory structure; that would be broken!
filepaths.sort()
 
# Get the list of all file contents.
# Normally, the paths should be traversed sequentially, thus the contents
# should be in the correct order for the SHA256 hash function.
# TODO: I don't know if this is the case, you should verify it
# if you are serious about using this code to get started.
filecontents = []
for filepath in filepaths:
fp = open(branch + "/" + filepath, "r")
filecontents.append(fp.read())
fp.close()
 
# Get file paths string.
filepathsstr = ""
for filepath in filepaths:
filepathsstr += filepath
 
# Get file contents string.
filecontentsstr = ""
for filecontent in filecontents:
filecontentsstr += filecontent
 
# Derive the commit hash.
commithash = hashlib.sha256(filepathsstr + filecontentsstr).hexdigest()
 
# Create the .repo/commits/<commit> directory. If the directory already
# exists, we will just update the list of branches where it is present.
commitexists = False
try:
os.mkdir(".repo/commits/" + commithash, 0755)
except OSError:
commitexists = True
 
if commitexists:
# Update the .repo/commits/<commit>/branches file, in case <branch>
# is not found in it.
commitbranches = utils.readlines(".repo/commits/" + commithash + "/branches")
found = False
for i in range(0, len(commitbranches)):
if commitbranches[i] == branch:
found = True
break
 
if not found:
commitbranches.append(branch)
utils.writelines(".repo/commits/" + commithash + "/branches", commitbranches)
 
else:
# Loop through paths and contents.
filehashes = []
for i in range(0, len(filepaths)):
# Create the file hash. We will use the file contents.
filehash = hashlib.sha256(filecontents[i]).hexdigest()
 
# Append the file hash to the list of file hashes.
filehashes.append(filehash)
 
# Create the .repo/files/<file> directory. If the file already exists,
# we will just update the list of commits where it is present.
fileexists = False
try:
os.mkdir(".repo/files/" + filehash, 0755)
except OSError:
# File already exists.
fileexists = True
 
if fileexists:
# Update the .repo/files/<file>/commits file.
filecommits = utils.readlines(".repo/files/" + filehash + "/commits")
filecommits.append(commithash)
utils.writelines(".repo/files/" + filehash + "/commits", filecommits)
else:
# Create the .repo/files/<file>/data file.
fp = open(".repo/files/" + filehash + "/data", "w")
fp.write(filecontents[i])
fp.close()
 
# Create the .repo/files/<file>/commits file.
filecommits = [commithash]
utils.writelines(".repo/files/" + filehash + "/commits", filecommits)
 
# Create the .repo/commits/<commit>/files file.
commitfiles = []
for i in range(0, len(filepaths)):
commitfiles.append([filehashes[i], filepaths[i]])
utils.writewordlines(".repo/commits/" + commithash + "/files", commitfiles)
 
# Create the .repo/commits/<commit>/branches file.
commitbranches = [branch]
utils.writelines(".repo/commits/" + commithash + "/branches", commitbranches)
 
# Update the .repo/branches/<branch>/commits file.
branchcommits = utils.readwordlines(".repo/branches/" + branch + "/commits")
branchcommits.append([commithash, "{}".format(now)])
utils.writewordlines(".repo/branches/" + branch + "/commits", branchcommits)
 
if __name__ == "__main__":
main()
</source>
 
==== fsck.py ====
 
This one is very important. It does an integrity check of the repository (also useful when testing if the VCS that will run on your OS is compatible with this one).
 
It is also quite complex, I hope it is correct and that I explained stuff well in the comments. Again, make it executable.
 
<source lang="python">
#! /usr/bin/python
 
import hashlib
import os
 
import utils
 
def fsckfile(f):
# Get file data.
fp = open(".repo/files/" + f + "/data", "r")
data = fp.read()
fp.close()
 
# Check if the file hash is correct.
filehash = hashlib.sha256(data).hexdigest()
if filehash != f:
print("Integrity check failed: File " + f + " should have hash " + filehash)
 
def checkcommitinfile(commit, f):
# Get the list of commits that <f> lists. <commit> should be listed.
commits = utils.readlines(".repo/files/" + f + "/commits")
found = False
for c in commits:
if c == commit:
found = True
break
 
if not found:
print("Integrity check failed: File " + f + " is listed in commit " + commit + ", but file " + f + " doesn't list commit " + commit)
 
def fsckcommit(commit):
# Get the list of files that are listed in <commit>.
files = utils.readwordlines(".repo/commits/" + commit + "/files")
 
# Get the list of file hashes and paths based on files above.
filehashes = []
filepaths = []
for f in files:
filehashes.append(f[0])
filepaths.append(f[1])
 
# Get file paths string.
filepathsstr = ""
for filepath in filepaths:
filepathsstr += filepath
 
# Get file contents string. Also detect if some files are missing.
filecontentsstr = ""
missing = False
for filehash in filehashes:
try:
fp = open(".repo/files/" + filehash + "/data")
filecontentsstr += fp.read()
fp.close()
except IOError:
print("Integrity check failed: Commit " + commit + " lists file " + filehash + ", but file " + filehash + " doesn't exist")
missing = True
 
if not missing:
# Check if the commit hash is correct.
commithash = hashlib.sha256(filepathsstr + filecontentsstr).hexdigest()
if commithash != commit:
print("Integrity check failed: Commit " + commit + " should have hash " + commithash)
 
# Check if the relation of this commit and its files is bidirectional.
for filehash in filehashes:
checkcommitinfile(commit, filehash)
 
def checkbranchincommit(branch, commit):
# Get the list of branches that <commit> lists. <branch> should be listed.
branches = utils.readlines(".repo/commits/" + commit + "/branches")
found = False
for b in branches:
if b == branch:
found = True
break
 
if not found:
print("Integrity check failed: Commit " + commit + " is listed in branch " + branch + ", but commit " + commit + " doesn't list branch " + branch)
 
def fsckbranch(branch):
# Get the list of commits that are listed in <branch>.
commits = utils.readwordlines(".repo/branches/" + branch + "/commits")
 
# Get the list of commit hashes based on commits above.
commithashes = []
for commit in commits:
commithashes.append(commit[0])
# Check if the relation of this branch and its commits is bidirectional.
# Also detect if any commits are missing.
for commithash in commithashes:
if os.path.exists(".repo/commits/" + commithash):
checkbranchincommit(branch, commithash)
else:
print("Integrity check failed: Branch " + branch + " lists commit " + commithash + ", but commit " + commithash + " doesn't exist")
 
def main():
# Check integrity for each file.
for cur, dirs, files in os.walk(".repo/files"):
for d in dirs:
fsckfile(d)
 
# Check integrity for each commit.
for cur, dirs, files in os.walk(".repo/commits"):
for d in dirs:
fsckcommit(d)
 
# Check integrity for each branch.
for cur, dirs, files in os.walk(".repo/branches"):
for d in dirs:
fsckbranch(d)
 
if __name__ == "__main__":
main()
</source>
 
==== merge.py ====
 
'''''TODO'''''
241

edits

Cookies help us deliver our services. By using our services, you agree to our use of cookies.

Navigation menu