Thrift框架详解(四)

协议和编解码(2)-TCompactProtocol

Posted by Jason Lee on 2020-03-21

概诉

上一届主要讨论了Thrift编解码中的二进制协议格式。这是一种比较大众的协议。本节来讨论二进制协议的另一种加强版本。TCompactProtocol

编解码

对于二进制编码,经常需要对数据进行压缩以节省空间,varint可以压缩较小的正数,但是对于负数varint反而更浪费空间,zigzag编码可以处理负数,使负数也可以使用varint编码压缩,protobuf和thrift都使用二者结合的方式来压缩数字类型。

原码、反码、补码
首先,计算机为了方便位运算,使用补码来存储数字。

然后回顾下:

  • 原码:最高位为符号位,剩余位表示绝对值;
  • 反码:除符号位外,对原码剩余位依次取反;
  • 补码:对于正数,补码为其自身;对于负数,除符号位外对原码剩余位依次取反然后+1。

varint

本文以int类型为例,详细介绍如果通过varint+zigzag编码技术压缩数字。
首先,我们常用的数字其实多数都不是很大,比如:商品的价值、动态计数等,对于这些不是很大的数字,二进制的高位多数是0。
varint编码每个字节前1位表示下一个字节是否也是该数字的一部分,后7位表示实际的值,最后,先低位后高位,对于int类型来说,varint编码最少占用1个字节,最多占用5个字节。

varint编码java代码表示:

1
2
3
4
5
6
byte[] bytes = new byte[5];
int idx = 0;
for(idx = 0; (n & -128) != 0; n >>>= 7) {
bytes[idx++] = (byte)(n & 127 | 128);
}
bytes[idx++] = (byte)n;

例如,对于int类型1来说,二进制表示为 00000000 00000000 00000000 00000001,总共占用4个字节,然而,前3个字节都是0,varint就是通过压缩高位的0来达到节省空间的目的,使用varint压缩后,二进制表示为 00000001,只占用1个字节。

对于int类型2^30来说二进制表示为
01000000 00000000 00000000 00000000,使用varint压缩后,二进制表示为10000000 10000000 10000000 10000000 00000100,占用5个字节。

zigzag

对于负数来说,因为最高位符号位始终为1,使用 varint 编码就很浪费空间,zigzag 编码就是解决负数的问题的,同时其对正数也没有很大的影响。

int类型zigzag变换的代码表示为(n << 1) ^ (n >> 31)

  • 左移1位可以消去符号位,低位补0
  • 有符号右移31位将符号位移动到最低位,负数高位补1,正数高位补0

例如:对于正数来说,最低位符号位为0,其他位不变
对于负数,最低位符号位为1,其他位按位取反
-1的二进制表示为 11111111 11111111 11111111 11111111,zigzag变换后 00000000 00000000 00000000 00000001,再用varint编码,是不是很小了。

1的二进制表示为 00000000 00000000 00000000 00000001,zigzag变换后00000001,再用varint编码,依然很小。

解码

zigzag操作为(n >>> 1) ^ -(n & 1)

正数:当前的数乘以2, zigzagY = x * 2
负数:当前的数乘以-2后减1, zigzagY = x * -2 - 1

  • 无符号右移1位
  • 按位与1,然后取负值,这一步非常巧妙,对于正数就是0,负数就是-1
  • 按位异或得到结果
    • 正数是与0按位异或
    • 负数是与-1按位异或

-1 的解码过程为

  • 00000000 00000000 00000000 00000010 = 无符号右移1位((n >>> 1)) = 00000000 00000000 00000000 00000001

  • 00000000 00000000 00000000 00000010 = 按位与1(都为1则为1,否则为0)((n & 1))= 00000000 00000000 00000000 00000000 为负数

  • 上面两个记过异或 00000000 00000000 00000000 00000001 (异或 -1 相同为0,不同为1) = 11111111 11111111 11111111 11111111 = 00000000 00000000 00000000 00000001

TCompactProtocol 代码详解

TCompactProtocol对于二进制编码,经常需要对数据进行压缩以节省空间,varint可以压缩较小的正数,但是对于负数varint反而更浪费空间,zigzag编码可以处理负数,使负数也可以使用varint编码压缩,protobuf和thrift都使用二者结合的方式来压缩数字类型。

来看一下二级制的图示

代码

下面代码省略了一些部分信息,保留了关键注释和方法。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
public class TCompactProtocol extends TProtocol {
private final static TStruct ANONYMOUS_STRUCT = new TStruct("");
private final static TField TSTOP = new TField("", TType.STOP, (short)0);
private final static byte[] ttypeToCompactType = new byte[16];
static {
ttypeToCompactType[TType.STOP] = TType.STOP;
ttypeToCompactType[TType.BOOL] = Types.BOOLEAN_TRUE;
ttypeToCompactType[TType.BYTE] = Types.BYTE;
ttypeToCompactType[TType.I16] = Types.I16;
ttypeToCompactType[TType.I32] = Types.I32;
ttypeToCompactType[TType.I64] = Types.I64;
ttypeToCompactType[TType.DOUBLE] = Types.DOUBLE;
ttypeToCompactType[TType.STRING] = Types.BINARY;
ttypeToCompactType[TType.LIST] = Types.LIST;
ttypeToCompactType[TType.SET] = Types.SET;
ttypeToCompactType[TType.MAP] = Types.MAP;
ttypeToCompactType[TType.STRUCT] = Types.STRUCT;
}

private static final byte PROTOCOL_ID = (byte)0x82;
private static final byte VERSION = 1;
private static final byte VERSION_MASK = 0x1f; // 0001 1111
private static final byte TYPE_MASK = (byte)0xE0; // 1110 0000
private static final int TYPE_SHIFT_AMOUNT = 5;

private static class Types {
public static final byte BOOLEAN_TRUE = 0x01;
public static final byte BOOLEAN_FALSE = 0x02;
public static final byte BYTE = 0x03;
public static final byte I16 = 0x04;
public static final byte I32 = 0x05;
public static final byte I64 = 0x06;
public static final byte DOUBLE = 0x07;
public static final byte BINARY = 0x08;
public static final byte LIST = 0x09;
public static final byte SET = 0x0A;
public static final byte MAP = 0x0B;
public static final byte STRUCT = 0x0C;
}

/**
* Thrift 自己实现的short类型栈
*/
private ShortStack lastField_ = new ShortStack(15);

private short lastFieldId_ = 0;

/**
* If we encounter a boolean field begin, save the TField here so it can
* have the value incorporated.
*/
private TField booleanField_ = null;

/**
* If we read a field header, and it's a boolean field, save the boolean
* value here so that readBool can use it.
*/
private Boolean boolValue_ = null;

@Override
public void reset() {
lastField_.clear();
lastFieldId_ = 0;
}


/**
* Write a message header to the wire. Compact Protocol messages contain the
* protocol version so we can migrate forwards in the future if need be.
*/
public void writeMessageBegin(TMessage message) throws TException {
writeByteDirect(PROTOCOL_ID);
writeByteDirect((VERSION & VERSION_MASK) | ((message.type << TYPE_SHIFT_AMOUNT) & TYPE_MASK));
writeVarint32(message.seqid);
writeString(message.name);
}

/**
* Write a struct begin. This doesn't actually put anything on the wire. We
* use it as an opportunity to put special placeholder markers on the field
* stack so we can get the field id deltas correct.
*/
public void writeStructBegin(TStruct struct) throws TException {
lastField_.push(lastFieldId_);
lastFieldId_ = 0;
}

/**
* Write a struct end. This doesn't actually put anything on the wire. We use
* this as an opportunity to pop the last field from the current struct off
* of the field stack.
*/
public void writeStructEnd() throws TException {
lastFieldId_ = lastField_.pop();
}

/**
* Write a field header containing the field id and field type. If the
* difference between the current field id and the last one is small (< 15),
* then the field id will be encoded in the 4 MSB as a delta. Otherwise, the
* field id will follow the type header as a zigzag varint.
*/
public void writeFieldBegin(TField field) throws TException {
if (field.type == TType.BOOL) {
// we want to possibly include the value, so we'll wait.
booleanField_ = field;
} else {
writeFieldBeginInternal(field, (byte)-1);
}
}

/**
* The workhorse of writeFieldBegin. It has the option of doing a
* 'type override' of the type header. This is used specifically in the
* boolean field case.
*/
private void writeFieldBeginInternal(TField field, byte typeOverride) throws TException {
// short lastField = lastField_.pop();

// if there's a type override, use that.
byte typeToWrite = typeOverride == -1 ? getCompactType(field.type) : typeOverride;

// check if we can use delta encoding for the field id
if (field.id > lastFieldId_ && field.id - lastFieldId_ <= 15) {
// write them together
writeByteDirect((field.id - lastFieldId_) << 4 | typeToWrite);
} else {
// write them separate
writeByteDirect(typeToWrite);
writeI16(field.id);
}

lastFieldId_ = field.id;
// lastField_.push(field.id);
}

/**
* Write the STOP symbol so we know there are no more fields in this struct.
*/
public void writeFieldStop() throws TException {
writeByteDirect(TType.STOP);
}

/**
* Write a map header. If the map is empty, omit the key and value type
* headers, as we don't need any additional information to skip it.
*/
public void writeMapBegin(TMap map) throws TException {
if (map.size == 0) {
writeByteDirect(0);
} else {
writeVarint32(map.size);
writeByteDirect(getCompactType(map.keyType) << 4 | getCompactType(map.valueType));
}
}

/**
* Write a list header.
*/
public void writeListBegin(TList list) throws TException {
writeCollectionBegin(list.elemType, list.size);
}

/**
* Write a set header.
*/
public void writeSetBegin(TSet set) throws TException {
writeCollectionBegin(set.elemType, set.size);
}

/**
* Write a boolean value. Potentially, this could be a boolean field, in
* which case the field header info isn't written yet. If so, decide what the
* right type header is for the value and then write the field header.
* Otherwise, write a single byte.
*/
public void writeBool(boolean b) throws TException {
if (booleanField_ != null) {
// we haven't written the field header yet
writeFieldBeginInternal(booleanField_, b ? Types.BOOLEAN_TRUE : Types.BOOLEAN_FALSE);
booleanField_ = null;
} else {
// we're not part of a field, so just write the value.
writeByteDirect(b ? Types.BOOLEAN_TRUE : Types.BOOLEAN_FALSE);
}
}
public void writeByte(byte b) throws TException {
writeByteDirect(b);
}
//ZigZag编码:16位
public void writeI16(short i16) throws TException {
writeVarint32(intToZigZag(i16));
}
//ZigZag编码:32位
public void writeI32(int i32) throws TException {
writeVarint32(intToZigZag(i32));
}
//ZigZag编码:64位
public void writeI64(long i64) throws TException {
writeVarint64(longToZigzag(i64));
}

public void writeDouble(double dub) throws TException {
byte[] data = new byte[]{0, 0, 0, 0, 0, 0, 0, 0};
fixedLongToBytes(Double.doubleToLongBits(dub), data, 0);
trans_.write(data);
}

public void writeString(String str) throws TException {
try {
byte[] bytes = str.getBytes("UTF-8");
writeBinary(bytes, 0, bytes.length);
} catch (UnsupportedEncodingException e) {
throw new TException("UTF-8 not supported!");
}
}

/**
* Write a byte array, using a varint for the size.
*/
public void writeBinary(ByteBuffer bin) throws TException {
int length = bin.limit() - bin.position();
writeBinary(bin.array(), bin.position() + bin.arrayOffset(), length);
}

private void writeBinary(byte[] buf, int offset, int length) throws TException {
writeVarint32(length);
trans_.write(buf, offset, length);
}

// 其他层复写的方法
public void writeMessageEnd() throws TException {}
public void writeMapEnd() throws TException {}
public void writeListEnd() throws TException {}
public void writeSetEnd() throws TException {}
public void writeFieldEnd() throws TException {}

/**
* Abstract method for writing the start of lists and sets. List and sets on
* the wire differ only by the type indicator.
*/
protected void writeCollectionBegin(byte elemType, int size) throws TException {
if (size <= 14) {
writeByteDirect(size << 4 | getCompactType(elemType));
} else {
writeByteDirect(0xf0 | getCompactType(elemType));
writeVarint32(size);
}
}

/**
* Write an i32 as a varint. Results in 1-5 bytes on the wire.
* TODO: make a permanent buffer like writeVarint64?
*/
byte[] i32buf = new byte[5];
private void writeVarint32(int n) throws TException {
int idx = 0;
while (true) {
if ((n & ~0x7F) == 0) {
i32buf[idx++] = (byte)n;
// writeByteDirect((byte)n);
break;
// return;
} else {
i32buf[idx++] = (byte)((n & 0x7F) | 0x80);
// writeByteDirect((byte)((n & 0x7F) | 0x80));
n >>>= 7;
}
}
trans_.write(i32buf, 0, idx);
}

/**
* Write an i64 as a varint. Results in 1-10 bytes on the wire.
*/
byte[] varint64out = new byte[10];
private void writeVarint64(long n) throws TException {
int idx = 0;
while (true) {
if ((n & ~0x7FL) == 0) {
varint64out[idx++] = (byte)n;
break;
} else {
varint64out[idx++] = ((byte)((n & 0x7F) | 0x80));
n >>>= 7;
}
}
trans_.write(varint64out, 0, idx);
}

/**
* Convert l into a zigzag long. This allows negative numbers to be
* represented compactly as a varint.
*/
private long longToZigzag(long l) {
return (l << 1) ^ (l >> 63);
}

/**
* Convert n into a zigzag int. This allows negative numbers to be
* represented compactly as a varint.
*/
private int intToZigZag(int n) {
return (n << 1) ^ (n >> 31);
}

/**
* Convert a long into little-endian bytes in buf starting at off and going
* until off+7.
*/
private void fixedLongToBytes(long n, byte[] buf, int off) {
buf[off+0] = (byte)( n & 0xff);
buf[off+1] = (byte)((n >> 8 ) & 0xff);
buf[off+2] = (byte)((n >> 16) & 0xff);
buf[off+3] = (byte)((n >> 24) & 0xff);
buf[off+4] = (byte)((n >> 32) & 0xff);
buf[off+5] = (byte)((n >> 40) & 0xff);
buf[off+6] = (byte)((n >> 48) & 0xff);
buf[off+7] = (byte)((n >> 56) & 0xff);
}

/**
* Writes a byte without any possibility of all that field header nonsense.
* Used internally by other writing methods that know they need to write a byte.
*/
private byte[] byteDirectBuffer = new byte[1];
private void writeByteDirect(byte b) throws TException {
byteDirectBuffer[0] = b;
trans_.write(byteDirectBuffer);
}

/**
* Writes a byte without any possibility of all that field header nonsense.
*/
private void writeByteDirect(int n) throws TException {
writeByteDirect((byte)n);
}


//
// Reading methods.
//

/**
* Read a message header.
*/
public TMessage readMessageBegin() throws TException {
byte protocolId = readByte();
if (protocolId != PROTOCOL_ID) {
throw new TProtocolException("Expected protocol id " + Integer.toHexString(PROTOCOL_ID) + " but got " + Integer.toHexString(protocolId));
}
byte versionAndType = readByte();
byte version = (byte)(versionAndType & VERSION_MASK);
if (version != VERSION) {
throw new TProtocolException("Expected version " + VERSION + " but got " + version);
}
byte type = (byte)((versionAndType >> TYPE_SHIFT_AMOUNT) & 0x03);
int seqid = readVarint32();
String messageName = readString();
return new TMessage(messageName, type, seqid);
}

/**
* Read a struct begin. There's nothing on the wire for this, but it is our
* opportunity to push a new struct begin marker onto the field stack.
*/
public TStruct readStructBegin() throws TException {
lastField_.push(lastFieldId_);
lastFieldId_ = 0;
return ANONYMOUS_STRUCT;
}

/**
* Doesn't actually consume any wire data, just removes the last field for
* this struct from the field stack.
*/
public void readStructEnd() throws TException {
// consume the last field we read off the wire.
lastFieldId_ = lastField_.pop();
}

/**
* Read a field header off the wire.
*/
public TField readFieldBegin() throws TException {
byte type = readByte();

// if it's a stop, then we can return immediately, as the struct is over.
if (type == TType.STOP) {
return TSTOP;
}

short fieldId;

// mask off the 4 MSB of the type header. it could contain a field id delta.
short modifier = (short)((type & 0xf0) >> 4);
if (modifier == 0) {
// not a delta. look ahead for the zigzag varint field id.
fieldId = readI16();
} else {
// has a delta. add the delta to the last read field id.
fieldId = (short)(lastFieldId_ + modifier);
}

TField field = new TField("", getTType((byte)(type & 0x0f)), fieldId);

// if this happens to be a boolean field, the value is encoded in the type
if (isBoolType(type)) {
// save the boolean value in a special instance variable.
boolValue_ = (byte)(type & 0x0f) == Types.BOOLEAN_TRUE ? Boolean.TRUE : Boolean.FALSE;
}

// push the new field onto the field stack so we can keep the deltas going.
lastFieldId_ = field.id;
return field;
}

/**
* Read a map header off the wire. If the size is zero, skip reading the key
* and value type. This means that 0-length maps will yield TMaps without the
* "correct" types.
*/
public TMap readMapBegin() throws TException {
int size = readVarint32();
byte keyAndValueType = size == 0 ? 0 : readByte();
return new TMap(getTType((byte)(keyAndValueType >> 4)), getTType((byte)(keyAndValueType & 0xf)), size);
}

/**
* Read a list header off the wire. If the list size is 0-14, the size will
* be packed into the element type header. If it's a longer list, the 4 MSB
* of the element type header will be 0xF, and a varint will follow with the
* true size.
*/
public TList readListBegin() throws TException {
byte size_and_type = readByte();
int size = (size_and_type >> 4) & 0x0f;
if (size == 15) {
size = readVarint32();
}
byte type = getTType(size_and_type);
return new TList(type, size);
}

/**
* Read a set header off the wire. If the set size is 0-14, the size will
* be packed into the element type header. If it's a longer set, the 4 MSB
* of the element type header will be 0xF, and a varint will follow with the
* true size.
*/
public TSet readSetBegin() throws TException {
return new TSet(readListBegin());
}

/**
* Read a boolean off the wire. If this is a boolean field, the value should
* already have been read during readFieldBegin, so we'll just consume the
* pre-stored value. Otherwise, read a byte.
*/
public boolean readBool() throws TException {
if (boolValue_ != null) {
boolean result = boolValue_.booleanValue();
boolValue_ = null;
return result;
}
return readByte() == Types.BOOLEAN_TRUE;
}

byte[] byteRawBuf = new byte[1];
/**
* Read a single byte off the wire. Nothing interesting here.
*/
public byte readByte() throws TException {
byte b;
if (trans_.getBytesRemainingInBuffer() > 0) {
b = trans_.getBuffer()[trans_.getBufferPosition()];
trans_.consumeBuffer(1);
} else {
trans_.readAll(byteRawBuf, 0, 1);
b = byteRawBuf[0];
}
return b;
}

/**
* Read an i16 from the wire as a zigzag varint.
*/
public short readI16() throws TException {
return (short)zigzagToInt(readVarint32());
}

/**
* Read an i32 from the wire as a zigzag varint.
*/
public int readI32() throws TException {
return zigzagToInt(readVarint32());
}

/**
* Read an i64 from the wire as a zigzag varint.
*/
public long readI64() throws TException {
return zigzagToLong(readVarint64());
}

/**
* No magic here - just read a double off the wire.
*/
public double readDouble() throws TException {
byte[] longBits = new byte[8];
trans_.readAll(longBits, 0, 8);
return Double.longBitsToDouble(bytesToLong(longBits));
}

/**
* Reads a byte[] (via readBinary), and then UTF-8 decodes it.
*/
public String readString() throws TException {
int length = readVarint32();

if (length == 0) {
return "";
}

try {
if (trans_.getBytesRemainingInBuffer() >= length) {
String str = new String(trans_.getBuffer(), trans_.getBufferPosition(), length, "UTF-8");
trans_.consumeBuffer(length);
return str;
} else {
return new String(readBinary(length), "UTF-8");
}
} catch (UnsupportedEncodingException e) {
throw new TException("UTF-8 not supported!");
}
}

/**
* Read a byte[] from the wire.
*/
public ByteBuffer readBinary() throws TException {
int length = readVarint32();
if (length == 0) return ByteBuffer.wrap(new byte[0]);

byte[] buf = new byte[length];
trans_.readAll(buf, 0, length);
return ByteBuffer.wrap(buf);
}

/**
* Read a byte[] of a known length from the wire.
*/
private byte[] readBinary(int length) throws TException {
if (length == 0) return new byte[0];

byte[] buf = new byte[length];
trans_.readAll(buf, 0, length);
return buf;
}

//
// These methods are here for the struct to call, but don't have any wire
// encoding.
//
public void readMessageEnd() throws TException {}
public void readFieldEnd() throws TException {}
public void readMapEnd() throws TException {}
public void readListEnd() throws TException {}
public void readSetEnd() throws TException {}

//
// Internal reading methods
//

/**
* Read an i32 from the wire as a varint. The MSB of each byte is set
* if there is another byte to follow. This can read up to 5 bytes.
*/
private int readVarint32() throws TException {
int result = 0;
int shift = 0;
if (trans_.getBytesRemainingInBuffer() >= 5) {
byte[] buf = trans_.getBuffer();
int pos = trans_.getBufferPosition();
int off = 0;
while (true) {
byte b = buf[pos+off];
result |= (int) (b & 0x7f) << shift;
if ((b & 0x80) != 0x80) break;
shift += 7;
off++;
}
trans_.consumeBuffer(off+1);
} else {
while (true) {
byte b = readByte();
result |= (int) (b & 0x7f) << shift;
if ((b & 0x80) != 0x80) break;
shift += 7;
}
}
return result;
}

/**
* Read an i64 from the wire as a proper varint. The MSB of each byte is set
* if there is another byte to follow. This can read up to 10 bytes.
*/
private long readVarint64() throws TException {
int shift = 0;
long result = 0;
if (trans_.getBytesRemainingInBuffer() >= 10) {
byte[] buf = trans_.getBuffer();
int pos = trans_.getBufferPosition();
int off = 0;
while (true) {
byte b = buf[pos+off];
result |= (long) (b & 0x7f) << shift;
if ((b & 0x80) != 0x80) break;
shift += 7;
off++;
}
trans_.consumeBuffer(off+1);
} else {
while (true) {
byte b = readByte();
result |= (long) (b & 0x7f) << shift;
if ((b & 0x80) != 0x80) break;
shift +=7;
}
}
return result;
}

//
// encoding helpers
//

/**
* Convert from zigzag int to int.
*/
private int zigzagToInt(int n) {
return (n >>> 1) ^ -(n & 1);
}

/**
* Convert from zigzag long to long.
*/
private long zigzagToLong(long n) {
return (n >>> 1) ^ -(n & 1);
}

/**
* Note that it's important that the mask bytes are long literals,
* otherwise they'll default to ints, and when you shift an int left 56 bits,
* you just get a messed up int.
*/
private long bytesToLong(byte[] bytes) {
return
((bytes[7] & 0xffL) << 56) |
((bytes[6] & 0xffL) << 48) |
((bytes[5] & 0xffL) << 40) |
((bytes[4] & 0xffL) << 32) |
((bytes[3] & 0xffL) << 24) |
((bytes[2] & 0xffL) << 16) |
((bytes[1] & 0xffL) << 8) |
((bytes[0] & 0xffL));
}

//
// type testing and converting
//

private boolean isBoolType(byte b) {
int lowerNibble = b & 0x0f;
return lowerNibble == Types.BOOLEAN_TRUE || lowerNibble == Types.BOOLEAN_FALSE;
}

/**
* Given a TCompactProtocol.Types constant, convert it to its corresponding
* TType value.
*/
private byte getTType(byte type) throws TProtocolException {
switch ((byte)(type & 0x0f)) {
case TType.STOP:
return TType.STOP;
case Types.BOOLEAN_FALSE:
case Types.BOOLEAN_TRUE:
return TType.BOOL;
case Types.BYTE:
return TType.BYTE;
case Types.I16:
return TType.I16;
case Types.I32:
return TType.I32;
case Types.I64:
return TType.I64;
case Types.DOUBLE:
return TType.DOUBLE;
case Types.BINARY:
return TType.STRING;
case Types.LIST:
return TType.LIST;
case Types.SET:
return TType.SET;
case Types.MAP:
return TType.MAP;
case Types.STRUCT:
return TType.STRUCT;
default:
throw new TProtocolException("don't know what type: " + (byte)(type & 0x0f));
}
}

/**
* Given a TType value, find the appropriate TCompactProtocol.Types constant.
*/
private byte getCompactType(byte ttype) {
return ttypeToCompactType[ttype];
}
}

写入方式

这里我们需要讲几个具体的例子

写bool类型。

由于bool类型只有是否两种状态,且在我们的生成的代码当中是以TField的形式存在的,所有对于boolean的写入特殊处理。
写入bool类型的在struct 生成代码当中的的样例

TCompactProtocol写入Boolean分两种情况,

  • 1:该boolean值为TStruct中的内部成员时TField时,得写入header数据(即内容和数据类型压缩在一起写);
  • 2 :如果不为TField内部类型的话,直接按byte写入
1
2
oprot.writeFieldBegin(BOOLTYPE_FIELD_DESC);  //  BOOLTYPE_FIELD_DESC type = 
oprot.writeBool(struct.boolType);

所以在下面用booleanField_ 先保存了TField 然后在后续执行 writeBool 判断一下booleanField_ 是否为空

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public void writeFieldBegin(TField field) throws TException {
if (field.type == TType.BOOL) {
booleanField_ = field;
} else {
writeFieldBeginInternal(field, (byte)-1);
}
}

public void writeBool(boolean b) throws TException {
if (booleanField_ != null) {
// we haven't written the field header yet
writeFieldBeginInternal(booleanField_, b ? Types.BOOLEAN_TRUE : Types.BOOLEAN_FALSE);
booleanField_ = null;
} else {
// we're not part of a field, so just write the value.
writeByteDirect(b ? Types.BOOLEAN_TRUE : Types.BOOLEAN_FALSE);
}
}

writeFieldBeginInternal 压缩写法

writeFieldBeginInternal 每次写入一个Field 就说做一个记录,如果连续写的Field 不超过16个,那么他会将type 压缩写入。
具体的压缩过程。
第一,算出 Field的标号
第二, 把编号的低4为空余出来
第三,将type类型和Field的第四位拼起来。

比如:

  • field.id > this.lastFieldId_ = 3 即 为 0000 0011
  • << 4 = 0011 0000
  • 找到field type this.getCompactType(field.type)
  • | 上 type 比如说Type 为 1
  • 最终结果 0011 0001 写入

代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
static {
ttypeToCompactType[TType.STOP] = TType.STOP;
ttypeToCompactType[TType.BOOL] = Types.BOOLEAN_TRUE;
ttypeToCompactType[TType.BYTE] = Types.BYTE;
ttypeToCompactType[TType.I16] = Types.I16;
ttypeToCompactType[TType.I32] = Types.I32;
ttypeToCompactType[TType.I64] = Types.I64;
ttypeToCompactType[TType.DOUBLE] = Types.DOUBLE;
ttypeToCompactType[TType.STRING] = Types.BINARY;
ttypeToCompactType[TType.LIST] = Types.LIST;
ttypeToCompactType[TType.SET] = Types.SET;
ttypeToCompactType[TType.MAP] = Types.MAP;
ttypeToCompactType[TType.STRUCT] = Types.STRUCT;
}
private void writeFieldBeginInternal(TField field, byte typeOverride) throws TException {
// ttypeToCompactType[ttype]
byte typeToWrite = typeOverride == -1 ? this.getCompactType(field.type) : typeOverride;
// 得到 差值 这个差值在 0 - 15 一个byte 就可。
if (field.id > this.lastFieldId_ && field.id - this.lastFieldId_ <= 15) {
// 然后 << 4 表示 把地四位 空余出来,
this.writeByteDirect(field.id - this.lastFieldId_ << 4 | typeToWrite);
} else {
this.writeByteDirect(typeToWrite);
this.writeI16(field.id);
}

this.lastFieldId_ = field.id;
}

其他格式

  • bool类型:
    一个字节。

如果bool型的字段是结构体或消息的成员字段并且有编号,一个字节的高4位表示字段编号,低4位表示bool的值(0001:true, 0010:false),即:一个字节的低4位的值(true:1,false:2).

如果bool型的字段单独存在,一个字节表示值,即:一个字节的值(true:1,false:2).

  • Byte类型:

一个字节的编号与类型组合(高4位编号偏移1,低4位类型),一个字节的值.

  • I16类型:

一个字节的编号与类型组合(高4位编号偏移1,低4位类型),一至三个字节的值.

  • I32类型:

一个字节的编号与类型组合(高4位编号偏移1,低4位类型),一至五个字节的值.

  • I64类型:

一个字节的编号与类型组合(高4位编号偏移1,低4位类型),一至十个字节的值.

  • double类型:

一个字节的编号与类型组合(高4位编号偏移1,低4位类型),八个字节的值.注:把double类型的数据转成八字节保存,并用小端方式发送。

  • String类型:

一个字节的编号与类型组合(高4位编号偏移1,低4位类型),一至五个字节的负载数据的长度,负载数据.

  • Struct类型:

一个字节的编号与类型组合(高4位编号偏移1,低4位类型),结构体负载数据,一个字节的结束标记.

  • MAP类型:

一个字节的编号与类型组合(高4位编号偏移1,低4位类型),一至五个字节的map元素的个数,一个字节的键值类型组合(高4位键类型,低4位值类型),Map负载数据.

  • Set类型:

表示方式一:一个字节的编号与类型组合(高4位编号偏移1,低4位类型),一个字节的元素个数和值类型组合(高4位键元素个数,低4位值类型),Set负载数据.适用于Set中元素个数小于等于14个的情况

表示方式二:一个字节的编号与类型组合(高4位编号偏移1,低4位类型),一个字节的键值类型(高4位全为1,低4位值类型),一至五个字节的map元素的个数,Set负载数据.适用于Set中元素个数大于14个的情况

  • List类型:

表示方式一:一个字节的编号与类型组合(高4位编号偏移1,低4位类型),一个字节的元素个数和值类型组合(高4位键元素个数,低4位值类型),List负载数据.
适用于Set中元素个数小于等于14个的情况。

表示方式二:一个字节的编号与类型组合(高4位编号偏移1,低4位类型),一个字节的键值类型(高4位全为1,低4位值类型),一至五个字节的map元素的个数,List负载数据.适用于Set中元素个数大于14个的情况。

  • 消息(函数)类型:

一个字节的版本,一个字节的消息调用(请求:0x21,响应:0x41,异常:0x61,oneway:0x81),一至五个字节的消息名称长度,消息名称,消息参数负载数据,一个字节的结束标记。

以上说明是基于相邻字段的编号小于等于15的情况。

如果字段相邻编号大于15,需要把类型和编号分开表示:用一个字节表示类型,一至五个字节表示编号偏移值。

读取方式

经过代码分析,我们了解了写入的过程,下面来了解读取过程。
我们还是来看 读取Field的例子,Field 会封装在TMessage 当中,所以,首先read会读取TMessage,读取TMessage的没有什么特殊的,就是将TMessage 按照writeMessage 的顺序读取出来,之前会有一个判断版本号的一个操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
public TField readFieldBegin() throws TException {
byte type = readByte();
if (type == TType.STOP) {
return TSTOP;
}

short fieldId;

// 主要是计算field的id
// 写入是以高4为未field id的偏移 所以需要将field 的偏移算出来
short modifier = (short)((type & 0xf0) >> 4);
if (modifier == 0) {
//如果否没有偏移 直接 读取16位的field id
fieldId = readI16();
} else {
// 否则 偏移 + 组内编号得到 field整整的id
fieldId = (short)(lastFieldId_ + modifier);
}
// 转换为 type
TField field = new TField("", getTType((byte)(type & 0x0f)), fieldId);

// if this happens to be a boolean field, the value is encoded in the type
if (isBoolType(type)) {
// save the boolean value in a special instance variable.
boolValue_ = (byte)(type & 0x0f) == Types.BOOLEAN_TRUE ? Boolean.TRUE : Boolean.FALSE;
}

//记录field
lastFieldId_ = field.id;
return field;
}

read 方法举例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
public short readI16() throws TException {
return (short)zigzagToInt(readVarint32());
}

private int readVarint32() throws TException {
int result = 0;
int shift = 0;
// i32buf 是5个字节 也就是说 varint 最大字节为5
if (trans_.getBytesRemainingInBuffer() >= 5) {
byte[] buf = trans_.getBuffer();
int pos = trans_.getBufferPosition();
int off = 0;
while (true) {
// varin32的读法,建议读一下上面的 varint32的写法 这里不再详解
byte b = buf[pos+off];
result |= (int) (b & 0x7f) << shift;
if ((b & 0x80) != 0x80) break;
shift += 7;
off++;
}
trans_.consumeBuffer(off+1);
} else {
while (true) {
byte b = readByte();
result |= (int) (b & 0x7f) << shift;
if ((b & 0x80) != 0x80) break;
shift += 7;
}
}
return result;
}

压缩测试

本节我们来讨论压缩比,上面说了这么多,那么TCompactProtocol 的压缩比如何呢?
首先我们需要 MemeoryUtil 这个工具

  • 1、 编写测试bean 并生成对象
1
2
3
4
5
6
7
8
9
10
11
12
13
struct User {
1: required string name,
2: required i16 age = 0;
3: required bool gender,
4: required i32 No,
5: required i64 createTime,
6: required double grade,
7: required list<Friends> friends,
8: optional map<i32, Friends> mapUser,
9: optional set<Friends> setUser,
10: required UserType userType = UserType.STUDENT,
11: optional i32 number
}
  • 2、 编写测试方法。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
public class CompressionTest {

protected byte[] buf = new byte[1024];
protected TMemoryInputTransport memory ;
protected TByteArrayOutputStream tByteArrayOutputStream ;
protected TFramedTransport tFramedTransport;
protected TCompactProtocol tcprotocol;
protected Class<? extends TFramedTransport> aClass ;
Field writeBuffer_;

@Before
public void preCompactTest() throws Exception {
byte[] buf = new byte[1024];
memory = new TMemoryInputTransport(buf);
/*
利用反射 将写如的对象转换成 TByteArrayOutputStream
这样通过 TCompactProtocol 写入的就是 TByteArrayOutputStream 这个对象当中,方面我们统计
**/
tByteArrayOutputStream = new TByteArrayOutputStream(1024);
tFramedTransport = new TFramedTransport(memory);
tcprotocol = new TCompactProtocol(tFramedTransport);
aClass = tFramedTransport.getClass();

writeBuffer_ = aClass.getDeclaredField("writeBuffer_");
writeBuffer_.setAccessible(true);
writeBuffer_.set(tFramedTransport, tByteArrayOutputStream);

}
// MemoryUtil 是classmexer 当中的方法
@Test
public void testTMessageCompactTest() throws Exception {
final long[] total = {0};
IntStream.range(1, 1000)
.mapToObj(CompressionTest::mockUser)
.forEach(user -> {
try {
user.write(tcprotocol);
long l = MemoryUtil.deepMemoryUsageOf(user, MemoryUtil.VisibilityFilter.ALL);
total[0] +=l;
} catch (TException e) {
e.printStackTrace();
}
});

int len = tByteArrayOutputStream.len();
// 转换
double userTotalDouble = (double) total[0];
double rate = (double) len / userTotalDouble;
String format = String.format("%.4f", rate);
System.out.println("total:" + total[0]);
System.out.println("len:" + len);
System.out.println("rate:" + format);
}
// inet 类型的测试方法
@Test
public void testIntCompactTest() throws Exception {
for (int i = 1; i <= 1000; i++) {
tcprotocol.writeI32(i);
}
int len = tByteArrayOutputStream.len();
double rate = (double) len / (4 * 1000);
System.out.println("rate:" + rate);
System.out.println("len:" + len);
}
// 生成一个 测试对象
public static User mockUser (int no) {
User user = new User();
user.setAge(Short.MAX_VALUE);
user.setCreateTime(System.currentTimeMillis());
user.setGender(true);
user.setName("name" + no);
user.setNo(no);
Friends friends = new Friends();
friends.setNo(Short.MAX_VALUE);
user.setFriends(Collections.singletonList(friends));
return user;
}
  • 3、 将 classmexer.jar 添加到lib 当中去
  • 4、 jvm 参数加上代理 参数: VM options:如下图所示

最后得出结果
User 1000个的压缩结果如下

1
2
3
total:815184
len:42786
rate:0.0525

Int 类型压缩结果如下

1
2
rate:0.48425
len:1937

参考



支付宝打赏 微信打赏

赞赏一下