Git基本原理解析

常用命令

git cat-file 可以查看obj文件信息，obj文件分为blob、tree、commit三种。
git ls-files -s 可以查看index文件内容。
git verify-pack 可以查看pack压缩文件内容。

git add file 过程

object目录下生成blob文件

blob文件名为哈希过的file内容，内容为压缩过后的file内容。

同时，blob文件作为文件内容的唯一快照，哪怕是10M的文件，只加了一个点，由于sha1的哈希值不同，都会创建出新的快照，此时两个快照占用20M左右的空间。但同时，git会定期执行git gc压缩存储冗余快照到.git/objects/pack目录。

举例如下。
```
git init
echo "123" > test
git add .
find .git/objects/
----------------------------------------------------------
.git/objects/
.git/objects/pack
.git/objects/info
.git/objects/19
.git/objects/19/0a18037c64c43e6b11489df4bf0b9eb6d2c9bf
```
此时test文件内容的快照路径为git/objects/19/0a18037c64c43e6b11489df4bf0b9eb6d2c9bf，文件名为190a18，使用git cat-file可以输出文件内容。
```
git cat-file -t 190a18 # 输出obj类型
----------------------------------------------------------
blob
  
git cat-file -p 190a18 # 输出obj内容
----------------------------------------------------------
123
```

index文件新增关联信息

file文件名 -> blob文件名

git ls-files -s # 输出index文件内容
----------------------------------------------------------
100644 190a18037c64c43e6b11489df4bf0b9eb6d2c9bf 0       test

git status 过程

遍历工作区文件，如果index中没找到某文件名，说明这个文件没被add，是untracked状态。
遍历工作区文件，如果index中能找到某文件名，则计算该文件目前的hash，如果和index中的hash不同，说明这个文件被改过了，是modified状态。
rename操作单独讨论。

git commit 过程

生成tree对象

tree对象代表当前文件夹结构，由blob和tree组成。

例如，当前文件夹结构为：

├── children
│   └── test00
├── test0
└── test1

此时，工作区一共有有3个文件，等同3个blob对象，commit提交一次，生成2个tree对象和1commit对象。此时.git/objects目录应有6个对象。

tree .git/objects/
----------------------------------------------------------
.git/objects/
├── 0c
│   └── fbf08886fca9a91cb753ec8734c84fcbe52c9f
├── 57
│   └── 3541ac9702dd3969c9bc859d2b91ec1f7e6e56
├── 5d
│   └── f6552e4457cb115b7be32720acac0e3fda3cc4
├── 85
│   └── 3f703e7399c262269598d8c89f4d4244ae839c
├── d0
│   └── 0491fd7e5bb6fa28c517a0bb32b8b506539d4d
├── f6
│   └── 9017e8887ba3d0487a3ab96d582cb700f45348
├── info
└── pack

其中，5df655为tree对象，打印该对象：

git cat-file -p 5df655
----------------------------------------------------------
040000 tree 853f703e7399c262269598d8c89f4d4244ae839c    children
100644 blob 573541ac9702dd3969c9bc859d2b91ec1f7e6e56    test0
100644 blob d00491fd7e5bb6fa28c517a0bb32b8b506539d4d    test1

其中，853f70为子目录的tree对象，打印该对象：

git cat-file -p 853f70
----------------------------------------------------------
100644 blob 0cfbf08886fca9a91cb753ec8734c84fcbe52c9f    test2

此时可以完整的看出当前目录结构在git中的存储方式。

生成commit对象

commit对象包含tree信息和parent commit信息，tree作为文件夹快照，parent commit信息将commit串联在一起。

其中，f69017为commit对象，打印commit对象：

git cat-file -p f690
----------------------------------------------------------
tree 5df6552e4457cb115b7be32720acac0e3fda3cc4
author root <root@HIH-L-11940.cn.net.ntes> 1690858826 +0800
committer root <root@HIH-L-11940.cn.net.ntes> 1690858826 +0800
  
init

由于初次提交，没有parent信息，再次提交，打印最新的commit对象:

git cat-file -p 50fc
----------------------------------------------------------
tree 49cf9866d1fed861f3646e74226965eec3cc9fd2
parent f69017e8887ba3d0487a3ab96d582cb700f45348
author root <root@HIH-L-11940.cn.net.ntes> 1690859401 +0800
committer root <root@HIH-L-11940.cn.net.ntes> 1690859401 +0800
  
1st

此时可以完整的看出commit的链式存储结构。

HEAD 实现

不断追踪即可发现HEAD本质为指向某一个commit对象的指针。

cat HEAD

ref: refs/heads/master

cat ./refs/heads/master

c625e4a113dd7872e2384c4c14b065d26c7df654

git cat-file -p c625

tree 6b6558b66ef5d993d968db7cd21bfc43770f6062
parent 113b8bfbdf98012d9d13b0ff95b0f7331ed98a17
author root <root@HIH-L-11940.cn.net.ntes> 1690338266 +0800
committer root <root@HIH-L-11940.cn.net.ntes> 1690338266 +0800

2st cm

sha1安全性

git的blob对象不止是根据文件内容计算哈希值，为了尽量避免碰撞，具体压缩方式是计算文件名+文件大小+文件内容的sha1哈希值，所以还相对安全。

但仍存在极少数碰撞案例，目前社区准备升级到sha256。

git 垃圾回收

情况1：多次add，一次commit，部分add带来的blob是不需要的

echo "123" > test
git add .
echo "123456" > test
git add .	
git commit -m"1st commit"

git fsck # 可以用来检索所有dangling对象，dangling对象是不被任何引用指向的对象

Checking object directories: 100% (256/256), done.
dangling blob 083ffd782654010fa81deb72ca8bd53e566331fe

解决方案：

git prune # 删除所有dangling对象

情况2：新建分支，进行add、commit，后又删除，为了保证恢复，git不认为这是垃圾对象

解决方案：https://stackoverflow.com/questions/3797907/how-to-remove-unused-objects-from-a-git-repository

情况3：反复commit一个做了微小修改的大文件

vi test # 写一个10M大文件
git add .
git cm -m"1st"
vi test # 微小修改
git add .
git cm -m"2st"
vi test # 微小修改
git add .
git cm -m"3st"

此时object目录下有30M左右的文件

解决方案：

主动使用git gc把重复内容压缩到pack目录中（或等待git自行gc）

git gc

此时用verify-pack命令可以查看pack.idx文件具体信息

git verify-pack -v .git/objects/pack/pack-606c016dbf7fcd4c040eb32057308e2ff3b8af8e.idx

git rename

git rename 大致算法

以下为git diff关于rename的几个重要参数。

可以发现，判断rename的流程为：两层for循环遍历两个commit之间所有的-和+文件，只要-和+对应的一组文件相似度到达一定的阈值，则可以判定为rename。

-M（判断两个文件相似度是否达到了一定阈值，可以判定为重命名）

Detect renames. If n is specified, it is a threshold on the similarity index (i.e. amount of addition/deletions compared to the file’s size). For example, -M90% means Git should consider a delete/add pair to be a rename if more than 90% of the file hasn’t changed. Without a % sign, the number is to be read as a fraction, with a decimal point before it. I.e., -M5 becomes 0.5, and is thus the same as -M50%. Similarly, -M05 is the same as -M5%. To limit detection to exact renames, use -M100%. The default similarity index is 50%.

-C（判断两个文件相似度是否达到了一定阈值，可以判定为拷贝）

Detect copies as well as renames. See also --find-copies-harder. If n is specified, it has the same meaning as for -M<n>.

-l（防止源与目标暴力匹配量过大，设置的上限）

The -M and -C options involve some preliminary steps that can detect subsets of renames/copies cheaply, followed by an exhaustive fallback portion that compares all remaining unpaired destinations to all relevant sources. (For renames, only remaining unpaired sources are relevant; for copies, all original sources are relevant.) For N sources and destinations, this exhaustive check is O(N^2). This option prevents the exhaustive portion of rename/copy detection from running if the number of source/destination files involved exceeds the specified number. Defaults to diff.renameLimit. Note that a value of 0 is treated as unlimited.

以nasl为例。

three-way-merge的base下，views为{v0,v1,v2,v3}，其中一侧的分支下，views为{v0,v1,v4,v5}，此时diff操作结果为{v0, v1, -v2, -v3, +v4, +v5}。

rename检测时，v0和v1不用考虑，此时v2和v3都有可能是改名成v4或v5，所以需要依次计算v2-v4、v2-v5、v3-v4、v3-v5相似度，相似度到达一定阈值即会认为是rename。

所以rename的核心难点在于计算两个文件的相似度。

git rename 文件相似度计算

计算文件S和D之间的相似度流程如下：

将文件以chunk为单位切分，chunk标准为：一行字符长度在64B内，则算为一个chunk，若超过64B，则按64B为单位拆分
每个文件生成一个hashtable，k为chunk的hashcode，v为出现次数
对于每个chunk来说，在S中出现的次数为ns，在D中出现的次数为nd

3.1 如果nd ≤ ns，则说明该chunk从S从copy了nd次到D中

3.2 如果nd > ns，则说明该chunk在D中新增了nd - ns次，ns次copy了S
hashtable中同时保留每个chunk的字节个数，所有chunk的copy次数乘以其字节个数的总和，为src_copied。

最终相似分值计算方法为:

score = (int)(src_copied * MAX_SCORE / max_size);

其中max_size为S文件和D文件中较大的一方的字节数。

简单举例：

cat test # 第一次commit
----------------------------------------------------------
1
2
3
   
cat test1 # 第二次commit, 改名为test1, 同时新增一行4
----------------------------------------------------------
1
2
3
4
   
git diff HEAD HEAD~1 # 相似度为75%, 默认-M阈值为50%，所以能识别出rename
----------------------------------------------------------
diff --git a/test1 b/test
similarity index 75%
rename from test1
rename to test
index 94ebaf9..01e79c3 100644
--- a/test1
+++ b/test
@@ -1,4 +1,3 @@
 1
 2
 3
-4

test1为S，test为D

S的hashtable为{1 -> 1, 2 -> 1, 3 -> 1, 4 -> 1}

D的hashtable为{1 -> 1, 2 -> 1, 3 -> 1}

src_copied = 1 * 1 + 1 * 1 + 1 * 1 = 3

max_score = 100

max_size = 4

score = (3 * 100) / 4 = 75%

同时，文件如果明显差异过大，则直接认为不是rename

max_size = ((src->size > dst->size) ? src->size : dst->size); # 较大文件大小
base_size = ((src->size < dst->size) ? src->size : dst->size); # 较小文件大小
delta_size = max_size - base_size; # 两文件差值
if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
        return 0;

该方法优势在于，不仅将行数据作为diff粒度，同时考虑到一行内数据比较多的情况，将大行拆分，且拆分粒度可控（chunk大小）。

例如源文件第一行的第二块数据和目标文件第十行的第六块数据完全一致，如果直接按行为粒度计算相似度，则可能会错过这部分相似数据。

chunkSize默认为64B，理论上说，不考虑极端情况，chunkSize越大性能越好，相似性准确度越低，chunkSize越小性能越差，相似度准确性越高。

git mv 原理

摘自官方FAQ：

Git has a rename command git mv, but that is just for convenience. The effect is indistinguishable from removing the file and adding another with different name and the same content.